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Foreword 


If you can't do bioinformatics, you can't do biology, and Perl is the biologist's favorite 
language for doing bioinformatics. The genomics revolution has so altered the landscape of 
biology that almost anyone who works at the bench now spends much of his time at the 
computer as well, browsing through the large online databases of genes, proteins, 
interactions and published papers. For example, the availability of an (almost) complete 
catalog of all the genes in human has fundamentally changed how anyone involved in genetic 
research works. Traditionally, a biologist would spend days thinking out the strategy for 
identifying a gene and months working in the lab cloning and screening to get his hands on it. 
Now he spends days thinking out the appropriate strategy for mining the gene from a 
genome database, seconds executing the query, and another few minutes ordering the 
appropriate clone from the resource center. The availability of genomes from many species 
and phyla makes it possible to apply comparative genomics techniques to the problems of 
identifying functionally significant portions of proteins or finding the genes responsible for a 
species' or strains distinguishing traits. 


Parallel revolutions are occurring in neurobiology, in which new imaging techniques allow 
functional changes in the nervous systems of higher organisms to be observed in situ; in 
clinical research, where the computer database is rapidly replacing the paper chart; and even 
in botany, where herbaria are being digitized and cataloged for online access. 


Biology is undergoing a sea change, evolving into an information-driven science in which the 
acquisition of large-scale data sets followed by pattern recognition and data mining plays just 
as prominent a role as traditional hypothesis testing. The two approaches are 
complementary: the patterns discovered in large-scale data sets suggest hypotheses to test, 
while hypotheses can be tested directly on the data sets stored in online databases. 


To take advantage of the new biology, biologists must be as comfortable with the computer 
as they now are with thermocyclers and electrophoresis units. Web-based access to biological 
databases and the various collections of prepackaged data analysis tools are wonderful, but 
often they are not quite enough. To really make the most of the information revolution in 
biology, biologists must be able to manage and analyze large amounts of data obtained from 
many different sources. This means writing software. The ability to create a Perl script to 
automate information management is a great advantage: whether the task is as simple as 
checking a remote web page for updates or as complex as knitting together a large number 
of third-party software packages into an analytic pipeline. 


In his first bioinformatics book, Beginning Perl for Bioinformatics, Jim introduced the 
fundamentals of programming in the language most widely used in the field. This book goes 
the next step, showing how Perl can be used to create large software projects that are 
scalable and reusable. If you are programming in Perl now and have experienced that wave of 
panic when you go back to some code you wrote six months ago and can't understand how 
the code works, then you know why you need this book. If you are an accomplished 
programmer who has heard about bioinformatics and wants to learn more, this book is also 
for you. Finally, if you are a biologist who wants to ride the crest of the information wave 
rather than being washed underneath it, then buy both this book along with Beginning Perl for 
Bioinformatics. I promise you won't be disappointed. 


Lincoln SteinCold Spring Harbor, NYSeptember 2003 
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Preface 


The history of biological research is filled with examples of new laboratory techniques which, 
at first, are suitable topics for doctoral theses but eventually become so widely useful and 
standard that they are learned by most undergraduates. The use of computer programming 
in biology research is such an increasingly standard skill for many biologists. Bioinformatics is 
one of the most rapidly growing areas of biological science. Fundamentally, it's a 
cross-disciplinary study, combining the questions of computer science and programming with 
those of biological research. 


As active sciences evolve, unifying principles and techniques developed in one field are often 
found to be useful in other areas. As a result, the established boundaries between disciplines 
are sometimes blurred, and the new principles and techniques may result in new ways of 
seeing the science as a whole. For instance, molecular biology has developed a set of 
techniques over the past 50 years that has also proved useful throughout much of biology in 
general. Similarly, the methods of bioinformatics are finding fertile ground in such fields as 
genetics, biochemistry, molecular biology, evolutionary science, development, cell studies, 
clinical research, and field biology. 


In my view, bioinformatics, which I define broadly as the use of computers in biological 
research, is becoming a foundational science for a broad range of biological studies. Just as 
it's now commonplace to find a geneticist or a field biologist using the techniques of molecular 
biology as a routine part of her research, so can you frequently find that same researcher 
applying the techniques of bioinformatics. Molecular biology and bioinformatics may not be 
the researcher's main areas of interest, but the tools from molecular biology and 
bioinformatics have become standard in searching for the answers to the questions of 
interest. The Perl programming language plays no small part in that search for answers. 
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About This Book 


This book is a continuation of my previous book, Beginning Perl for Bioinformatics (also by 
O'Reilly & Associates). As the title implies, Mastering Perl for Bioinformatics moves you to a 
more advanced level of Perl programming in bioinformatics. In this volume, I cover such 
topics as advanced data structures, object-oriented programming, modules, relational 
databases, web programming, and more advanced algorithms. The main goal of this book is 
to help you learn to write Perl programs that support your research in biology and enable you 
to adapt and use programs written by others. 


In the process of honing your programming skills, you will also learn the fundamentals of 
bioinformatics. For many readers, the material presented in these two books will be sufficient 
to support their goals in the laboratory. However, this book is not a comprehensive survey of 
bioinformatics techniques. Both Mastering Perl for Bioinformatics and Beginning Perl for 
Bioinformatics emphasize the computer programming aspects of bioinformatics. As a serious 
student, you should expect to follow this groundwork with further study in the bioinformatics 
literature. Even the Perl programming language has more complexity than can fit in this 
cross-disciplinary text. 


Readers already familiar with basic Perl and the elements of DNA and proteins can use 
Mastering Perl for Bioinformatics without reference to Beginning Perl for Bioinformatics. 
However, the two books together make a complete course suitable for undergraduates, 
graduate students, and professional biologists who need to learn programming for biology 
research. 


A companion web site at http://www.oreilly.com/catalog/mperlbio includes all the program 
code in the book. 
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What You Need to Know to Use This Book 


This book assumes that you have some experience with Perl, including a working knowledge 
of writing, saving, and running programs; basic Perl syntax; control structures such as loops 
and conditional tests; the most common operators such as addition, subtraction, and string 
concatenation; input and output from the user, files, and other programs; subroutines; the 
basic data types of scalar, array, and hash; and regular expressions for searching and for 
altering strings. In other words, you should be able to program Perl well enough to extract 
data from sources such as GenBank and the Protein Data Bank using pattern matching and 
regular expressions. 


If you are new to Perl but feel you can forge ahead using a language summary and examples 
of programs, Appendix A provides a summary of the important parts of the Perl language. 
Previous programming experience in a high-level language such as C, Java, or FORTRAN (or 
any similar language); some experience at using subroutines to break a large problem into 
smaller, appropriately interrelated parts; and a tinkerer's delight in taking things apart and 
seeing what makes them tick may be all the computer-science prerequisites you need. 


This book is primarily written for biologists, so it assumes you know the elementary facts 
about DNA, proteins, and restriction enzymes; how to represent DNA and protein data ina 
Perl program; how to search for motifs; and the structure and use of the databases 
GenBank, PDB, and Rebase. Because the book assumes you are a biologist, biology concepts 
are not explained in detail in order to concentrate on programming skills. 


Biological data appears in many forms. The most important sources of biological data include 
the repository of public genetic data called GenBank (Genetic Data Bank) and the repository 
of public protein structure data called PDB (Protein Data Bank). Many other similar sources of 
biological data such as Rebase (Restriction Enzyme Database) are in wide use. All the 
databases just mentioned are most commonly distributed as text files, which makes Perl a 
good programming tool to find and extract information from the databases. 
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Organization of This Book 


Here's a quick summary of what the book covers. If you're still relatively new to Perl you may 
want to work through the chapters in order. If you have some programming experience and 
are looking for ways to approach problems in bioinformatics with Perl, feel free to skip 
around. 


Part I 


Chapter 1 


Modules are the standard Perl way of "packaging" useful programs so that other 
programmers can easily use previous work. Such standard modules as CGI, for 
instance, put the power of interactive web site programming within reach of a 
programmer who knows basic Perl. Also discussed in later chapters are Bioperl, for 
manipulating biological data, and DBI, for gaining access to relational databases. 
Modules are sometimes considered the most important part of Perl because that's 
where a lot of the functionality of Perl has been placed. In this chapter I show how to 
write your own modules, as well as how to find useful modules and use them in your 
programs. 


Chapter 2 


Complex data structures and references are fundamentally important to Perl. The basic 
Perl data structures of scalar, array, and hash go a long way toward solving many 
(perhaps most) Perl programming problems. However, many commonly used data 
structures such as multidimensional arrays, for instance, require more sophisticated 
Perl data structures to handle them. Perl enables you to define quite complex data 
structures, and we'll see how all that works. 


String algorithms are standard techniques used in bioinformatics for finding important 
data in biological sequences; with them, you can compare two sequences, align two or 
more sequences, assemble a collection of sequence fragments, and so forth. String 
algorithms underlie many of the most commonly used programs in biology research, 
such as BLAST. In this chapter, a string matching algorithm that finds the closest 
match to a motif, based on the technique of dynamic programming, is presented in the 
form of a working Perl program. 


Chapter 3 
Object-oriented programming is a standard approach to designing programs. I 
assume, as a prerequisite, that you are familiar with the programming style called 
declarative programming. (For example, C and FORTRAN are declarative; C++ and 
Java are object-oriented; Perl can be either.) It's important for the Perl programmer 
to be familiar with the object-oriented approach. For instance, modules are usually 
defined in an object-oriented manner. 


This chapter presents, step by step, the concepts and techniques of object-oriented 
Perl programming, in the context of a module that defines a simple class for keeping 
track of genes. 


Chapter 4 


In this chapter, object-oriented programming is further explored in the context of 
developing software to convert sequence files to alternate formats (FASTA, GCG, 
etc.). The concept of class inheritance is introduced and implemented. 


Chapter 5 


This chapter further develops object-oriented programming by writing a class that 
handles Rebase restriction enzyme data, a class that calculates restriction maps, and a 
class that draws restriction maps. 
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Part IT 

Chapter 6 
Relational databases are important in programming because they save, organize, and 
retrieve data sets. This chapter introduces relational databases and the SQL language 
and includes information on designing and administering databases. I take a close look 
at how one such relational database management system, the popular MySQL, is used 
from the Perl language. 


Chapter 7 
Web programming is one of Perl's areas of strength. In this chapter, I start an example 
that puts a laboratory up on the Web using Perl and the CGI module. The software 
developed in previous chapters for restriction mapping is made accessible from the 
Web. 


Chapter 8 
Using computer graphics to display data is one of the most important programming 
skills in bioinformatics. In this chapter, graphics programs are used to dynamically 
display the output of restriction maps and data presented as graphs on the Web. The 
Perl module GD is discussed and used to generate maps on the fly from web page 
queries. 


Chapter 9 


Bioperl is a set of modules used by Perl programmers to write bioinformatics 
applications. In this chapter you'll see an introduction of the Bioperl project. Bioperl is 
open source (free under a very nonrestrictive copyright) and developed by a group of 
volunteers, many based in supportive research organizations. In recent years it has 
achieved critical mass and is now adequately documented and fairly broad in scope. If 
you do Perl bioinformatics programming, you should certainly be aware of what Bioperl 
has to offer, to avoid reinventing the wheel. 


Part ITT 
Appendix A 
This appendix summarizes the parts of Perl we've covered. 


Appendix B 
This appendix outlines how to install Perl. 
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Conventions Used in This Book 


The following conventions are used in this book: 

Constant width 
Used for arrays, classes, code examples, loops, modules, namespaces, objects, 
packages, statements, and to show the output of commands. 

Italics 


Used for commands, directory names, filenames, example URLs, variables, and for 
new terms where they are defined. 


This icon designates a note, which is an important aside to the nearby 
text. 





| This icon designates a warning relating to the nearby text. 


4 PREVIOUS | NEXT > 
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Comments and Questions 


Please address comments and questions concerning this book to the publisher: 


O'Reilly & Associates, Inc. 

1005 Gravenstein Highway North 

Sebastopol, CA 95472 

(800) 998-9938 (in the United States or Canada) 
(707) 829-0515 (international or local) 

(707) 829-0104 (fax) 


There is a web page for this book, which lists errata, examples, or any additional information. 
You can access this page at: 


http://www.oreilly.com/catalog/mperlbio 





To comment or ask technical questions about this book, send email to: 
bookquestions@oreilly.com 





For more information about books, conferences, Resource Centers, and the O'Reilly Network, 
see the O'Reilly web site at: 





http://www.oreilly.com 
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Chapter 1. Modular Programming with 
Perl 


Perl modules are essential to any Perl programmer. They are a great way to organize code 
into logical collections of interacting parts. They collect useful Perl subroutines and provide 
them to other programs (and programmers) in an organized and convenient fashion. 


This chapter begins with a discussion of the reasons for organizing Perl code into modules. 
Modules are comparable to subroutines: both organize Perl code in convenient, reusable 
"chunks." 


Later in this chapter, I'll introduce a small module, GeneticCode.pm. This example shows how 
to create simple modules, and I'll give examples of programs that use this module. 


I'll also demonstrate how to find, install, and use modules taken from the all-important CPAN 
collection. A familiarity with searching and using CPAN is an essential skill for Perl 
programmers; it will help you avoid lots of unnecessary work. With CPAN, you can easily find 
and use code written by excellent programmers and road-tested by the Perl community. 
Using proven code and writing less of your own, you'll save time, money, and headaches. 
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1.1 What Ils a Module? 


A Perl module is a library file that uses package declarations to create its own namespace. 
Perl modules provide an extra level of protection from name collisions beyond that provided 
by my and use strict. They also serve as the basic mechanism for defining object-oriented 
classes. 
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1.2 Why Perl Modules? 


Building a medium- to large-sized program usually requires you to divide tasks into several 
smaller, more manageable, and more interactive pieces. (A rule of thumb is that each "piece" 
should be about one or two printed pages in length, but this is just a general guideline.) An 
analogy can be made to building a microarray machine, which requires that you construct 
separate interacting pieces such as housing, temperature sensors and controls, robot arms to 
position the pipettes, hydraulic injection devices, and computer guidance for all these 
systems. 


1.2.1 Subroutines and Software Engineering 


Subroutines divide a large programming job into more manageable pieces. Modern 
programming languages all provide subroutines, which are also called functions, coroutines, or 
macros in other programming languages. 


A subroutine lets you write a piece of code that performs some part of a desired computation 
(e.g., determining the length of DNA sequence). This code is written once and then can be 
called frequently throughout the main program. Using subroutines speeds the time it takes to 
write the main program, makes it more reliable by avoiding duplicated sections (which can get 
out of sync and make the program longer), and makes the entire program easier to test. A 
useful subroutine can be used by other programs as well, saving you development time in the 
future. As long as the inputs and outputs to the subroutine remain the same, its internal 
workings can be altered and improved without worrying about how the changes will affect the 
rest of the program. This is known as encapsulation. 


The benefits of subroutines that I've just outlined also apply to other approaches in software 
engineering. Perl modules are a technique within a larger umbrella of techniques known as s 
oftware encapsulation and reuse. Software encapsulation and reuse are fundamental to 
object-oriented programming. 


A related design principle is abstraction, which involves writing code that is usable in many 
different situations. Let's say you write a subroutine that adds the fragment TTTTT to the end 
of a string of DNA. If you then want to add the fragment AAAAA to the end of a string of DNA, 
you have to write another subroutine. To avoid writing two subroutines, you can write one 
that's more abstract and adds to the end of a string of DNA whatever fragment you give it as 
an argument. Using the principle of abstraction, you've saved yourself half the work. 


Here is an example of a Perl subroutine that takes two strings of DNA as inputs and returns 
the second one appended to the end of the first: 
sub DNAappend { 
my (Sdna, S$tail) = @ ; 
return(Sdna . Stail); 
} 


This subroutine can be used as follows: 
my Sdna = 'ACCGGAGTTGACTCTCCGAATA'; 
my SpolyT = ‘'TTTTTTTT'; 
print DNAappend(Sdna, SpolyT); 
If you wish, you can also define subroutines polyT and polyA like so: 
sub polyT { 
my ($dna) = @ ; 


return DNAappend($dna, 'TTTTTTTT'); 
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sub polyA { 
my ($dna) = @ ; 


return DNAappend($dna, 'AAAAAAAA') ; 
} 


At this point, you should think about how to divide a problem into interacting parts; that is, an 
optimal (or at least good) way to define a set of subroutines that can cooperate to solve a 
particular problem. 


1.2.2 Modules and Libraries 


In my projects, I gather subroutine definitions into separate files called /ibraries,“’ or modules, 
which let me collect subroutine definitions for use in other programs. Then, instead of copying 
the subroutine definitions into the new program (and introducing the potential for inaccurate 
copies or for alternate versions proliferating), I can just insert the name of the library or 
module into a program, and all the subroutines are available in their original unaltered form. 
This is an example of software reuse in action. 


"] perl libraries were traditionally put in files ending with .p/, which stands for perl /ibrary; the term /ibrary is 


also used to refer to a collection of Perl modules. The common denominator is that a library is a collection of 
reusable subroutines. 


To fully understand and use modules, you need to understand the simple concepts of 
namespaces and packages. From here on, think of a Perl module as any Perl library file that 
uses package declarations to create its own namespace. These simple concepts are 
examined in the next sections. 


Team LiB 


Page 19 


[Team LiB J 


1.3 Namespaces 


A namespace is implemented as a table containing the names of the variables and subroutines 
in a program. The table itself is called a symbo/ table and is used by the running program to 
keep track of variable values and subroutine definitions as the program evolves. A namespace 
and a symbol table are essentially the same thing. A namespace exists under the hood for 
many programs, especially those in which only one default namespace is used. 


Large programs often accidentally use the same variable name for different variables in 
different parts of the program. These identically named variables may unintentionally interact 
with each other and cause serious, hard-to-find errors. This situation is called namespace 
collision. Separate namespaces are one way to avoid namespace collision. 


The package declaration described in the next section is one way to assign separate 
namespaces to different parts of your code. It gives strong protection against accidentally 
using a variable name that's used in another part of the program and having the two 
identically-named variables interact in unwanted ways. 


1.3.1 Namespaces Compared with Scoping: my and use strict 


The unintentional interaction between variables with the same name is enough of a problem 
that Perl provides more than one way to avoid it. You are probably already familiar with the 
use of my to restrict the scope of a variable to its enclosing block (between matching curly 
braces {}) and should be accustomed to using the directive use strict to require the use of 
my for all variables. use strict and my are a great way to protect your program from 


unintentional reuse of variable names. Make a habit of using my and working under use 
StEUCC. 
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1.4 Packages 


Packages are a different way to protect a program's variables from interacting unintentionally. 
In Perl, you can easily assign separate namespaces to entire sections of your code, which 
helps prevent namespace collisions and lets you create modules. 


Packages are very easy to use. A one-line package declaration puts a new namespace in 
effect. Here's a simple example: 

S$dna = 'AAAAAAAAAA'; 

package Mouse; 

$dna = 'CCCCCCCCCC'; 

package Celegans; 

S$dna = 'GGGGGGGGGG'; 


In this snippet, there are three variables, each with the same name, $dna. However, they are 
in three different packages, so they appear in three different symbol tables and are managed 
separately by the running Perl program. 


The first line of the code is an assignment of a poly-A DNA fragment to a variable $dna. 
Because no package is explicitly named, this $dna variable appears in the default namespace 
main. 


The second line of code introduces a new namespace for variable and subroutine definitions 
by declaring package Mouse;. At this point, the main namespace is no longer active, and the 
Mouse namespace is brought into play. Note that the name of the namespace is capitalized; 
it's a well-established convention you should follow. The only noncapitalized namespace you 
should use is the default main. 


Now that the Mouse namespace is in effect, the third line of code, which declares a variable, 
$dna, is actually declaring a separate variable unrelated to the first. It contains a poly-C 
fragment of DNA. 


Finally, the last two lines of code declare a new package called Celegans and a new variable, 
also called $dna, that stores a poly-G DNA fragment. 


To use these three Sdna variables, you need to explicitly state which packages you want the 
variables from, as the following code fragment demonstrates: 




















print "The DNA from the main package:\n\n"; 
print Smain::dna, "\n\n"; 

print "The DNA from the Mouse package:\n\n"; 
print S$Mouse::dna, "\n\n"; 

print "The DNA from the Celegans package:\n\n"; 
print $Celegans::dna, "\n\n"; 

This gives the following output: 

The DNA from the main package: 


AAAAAAAAAA 

The DNA from the Mouse package: 
CGCCOCECECE 

The DNA from the Celegans package: 


GGGGGGGGGG 
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As you can see, the variable name can be specified as to a particular package by putting the 
package name and two colons before the variable name (but after the $, @, or % that specifies 
the type of variable). If you don't specify a package in this way, Perl assumes you want the 
current package, which may not necessarily be the main package, as the following example 
shows: 


# 

# Define the variables in the packages 
# 

Sdna = 'AAAAAAAAAA'; 


package Mouse; 


Sdna = 'CCCCCCCCCC'; 

i 

# Print the values of the variables 

i 

print "The DNA from the current package:\n\n"; 
print Sdna, “\n\n"; 

print "The DNA from the Mouse package:\n\n"; 
print S$Mouse::dna, "\n\n"; 


This produces the following output: 


The DNA from the current package: 
CVCCECECCE 





The DNA from the Mouse package: 
CUCCECCECG 


Both print $dna and print $Mouse::dna reference the same variable. This is because the 
last package declaration was package Mouse;, So the print $dna statement prints the value 
of the variable $dna as defined in the current package, which is Mouse. 


The rule is, once a package has been declared, it becomes the current package until the next 
package declaration or until the end of the file. (You can also declare packages within blocks, 
evals, or subroutine definitions, in which case the package stays in effect until the end of the 
block, eval, or subroutine definition. ) 


By far the most common use of package is to call it once near the top of a file and have it 
stay in effect for all the code in the file. This is how modules are defined, as the next section 
shows. 

[ Team LiB ] 
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1.5 Defining Modules 


To begin, take a file of subroutine definitions and call it something like Newmodule.pm, Now, 
edit the file and give it a new first line: 


package Newmodule; 

and a new last line 1;. You've now created a Perl module. 

To make a Celegans module, place subroutines in a file called Celegans.pm, and add a first 
line: 

package Celegans; 


Add a last line 1;, and you've defined a Celegans module. This last line just ensures that the 
library returns a true value when it's read in. It's annoying, but necessary. 


Team LiB 
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1.6 Storing Modules 


Where you store your .pm module files on your computer affects the name of the module, so 
let's take a moment to sort out the most important points. For all the details, consult the 
perlmod and the perlmodlib parts of the Perl documentation at http://www.perldoc.org. 
You can also type perldoc perlmod or perldoc perlmodlib ata shell prompt or ina 
command window. 





Once you start using multiple files for your program code, which happens if you're defining 
and using modules, Perl needs to be able to find these various files; it provides a few different 
ways to do so. 


The simplest method is to put all your program files, including your modules, in the same 
directory and run your programs from that directory. Here's how the module file 
Celegans.pm is loaded from another program: 


use Celegans; 


However, it's often not so simple. Perl uses modules extensively; many are built-in when you 
install Perl, and many more are available from CPAN, as you'll see later. Some modules are 
used frequently, some rarely; many modules call other modules, which in turn call still other 
modules. 


To organize the many modules a Perl program might need, you should place them in certain 
standard directories or in your own development directories. Perl needs to know where these 
directories are so that when a module is called in a program, it can search the directories, find 
the file that contains the module, and load it in. 


When Perl was installed on your computer, a list of directories in which to find modules was 
configured. Every time a Perl program on your computer refers to a module, Perl looks in 
those directories. To see those directories, you only need to run a Perl program and examine 
the built-in array @INC, like so: 


pring join va", GING)... “Vay 
On my Linux computer, I get the following output from that statement: 


usr/local/lib/per15/5.8.0/i686-linux 
usr/local/lib/perl5/5.8.0 
usr/local/lib/perl5/site perl/5.8.0/i686-linux 
usr/local/lib/perl5/site perl/5.8.0 
usr/local/lib/perl5/site perl/5.6.1 
usx/lLocal/lab/perls/site: perl / 5.620 
usr/local/lib/perl5/site perl 








Pegs CPs Pres SR SS SR, 








These are all locations in which the standard Perl modules live on my Linux computer. @INC is 
simply an array whose entries are directories on your computer. The way it looks depends on 
how your computer is configured and your operating system (for instance, Unix computers 
handle directories a bit differently than Windows). 


Note that the last line of that list of directories is a solitary period. This is shorthand for "the 
current directory," that is, whatever directory you happen to be in when you run your Perl 
program. If this directory is on the list, and you run your program from that directory as well, 
Perl will find the .pm files. 


When you develop Perl software that uses modules, you should put all the modules together 
in a certain directory. In order for Perl to find this directory, and load the modules, you need 
to add a line before the use MODULE directives, telling Perl to additionally search your own 
module directory for any modules requested in your program. For instance, if I put a module 
I'm developing for my program into a file named Celegans.pm, and put the Celegans.pm file 
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into my Linux directory /home/tisdall/MasteringPerlBio/development/lib, I need to add a use 
lib directive to my program, like so: 


use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 


use Celegans; 


Perl then adds my development module directory to the @INC array and searches there for 
the Celegans.pm module file. The following code demonstrates this: 


use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 


print join(’\n", GING), wt" 
This produces the output: 


/home/tisdall/MasteringPerlBio/development/lib 
/usr/local/lib/perl5/5.8.0/i1686-linux 























/usr/local/lib/perl15/5.8.0 
/usr/local/lib/perl5/site perl/5.8.0/i1686-linux 
/usr/local/lib/perl5/site perl/5.8.0 
/usr/local/lib/perl5/site perl/5.6.1 
/usr/local/lib/perl5/site perl/5.6.0 

j 








usr/local/lib/perl5/site perl 





Thanks to the use lib directive, Perl can now find the Celegans. pm file in the @INC list of 
directories. 


A problem with this approach to finding libraries is that the directory pathnames are 
hardcoded into each program. If you then want to move your own library directory 
somewhere else or move the programs to another computer where different pathnames are 
used, you need to change the pathnames in all the program files where they occur. 


If, for instance, you download several programs from this book's web site, and you don't 
want to edit each one to change pathnames, you can use the PERL5LIB environmental 
variable. To do so, put all the modules under the directory /my/perl/modules (for example). 
Now set the PERL5LIB variable: 


PERL5LIB=SPERL5LIB: /my/perl/modules 











You can also set it this way: 
setenv PERL5LIB /my/perl/modules 











If you have "taint" security checks enabled in your version of Perl, you still have to hardcode 
the pathname into the program. This, of course, behaves differently on different operating 
systems. 


You can also specify an additional directory on the command line: 

perl -I/my/perl/modules myprogram.pl 

There's one other detail about modules that's important. You'll sometimes see modules in 
Perl programs with names such as Genomes: :Modelorganisms: :Celegans, in which the 
name is two or more words separated by two colons. This is how Perl looks into 
subdirectories of directories named in the @INc built-in array. In the example, Perl looks for a 
subdirectory named Genomes in one of the @INCc directories; then for a subdirectory named 
Modelorganisms within the Genomes subdirectory; finally, for a file named Celegans.pm within 
the Modelorganisms subdirectory. That is, my module is in the file: 


/home/tisdall/MasteringPerlBio/development/lib/Genomes/Modelorganisms/Celegan 
s.pm 


and it's called in my Perl program like so: 
use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 


use Genomes::Modelorganisms: :Celegans; 
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There are more details you can learn about storing and finding modules on your computer, 
but these are the most useful facts. See the perlmod, perlrun, and perlmodlib sections of 
the Perl manual for more details if and when you need them. 


i 
Tea 

m 

LiB 
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1.7 Writing Your First Perl Module 


Now that you've been introduced to the basic ideas of modules, it's time to actually examine 
a working example of a module. 


In this section, we'll write a module called Geneticcode.pm, which implements the genetic 
code that maps DNA codons to amino acids and then translates a string of DNA sequence 
data to a protein fragment. 


1.7.1 An Example: Geneticcode.pm 


Let's start by creating a file called Geneticcode.pm and using it to define the mapping of 
codons to amino acids in a hash variable called genetic_code. We'll also discuss a 
subroutine called codon2aa that uses the hash to translate its codon arguments into amino 
acid return values. 


Here are the contents of the first module file Geneticcode.pm: 


package Geneticcode; 


use strict; 
use warnings; 


my (genetic code) = ( 























"TCA! => NS", # Serine 
"TCC! => TS", Serine 
"TCG' => '"S"', Serine 
“TCE SS 'S", Serine 
'TTEe’ => “EF', Phenylalanine 
SEP Se NE Phenylalanine 
ETA. => "hy", Leucine 
'TTG' => 'L', Leucine 
"TACT 2 Sy, Tyrosine 
"TAT? Sy, Tyrosine 
"TAA' => ' ', Stop 

'"TAG' => ar Stop 

“EGE” => "Ee, Cysteine 
TG => TOV, # Cysteine 
"GAT => '_", # Stop 

'TGG' => 'W', # Tryptophan 
CTA. => 7h", # Leucine 
Cre’. => ti", Leucine 
'CTG' => 'L', Leucine 
COLT => th", Leucine 
CCA. => TRY, Proline 
"CCC' => 'P!, Proline 
"CCG" => 'P!, Proline 
NCCT =e PY, Proline 
'CAC' => 'H', Histidine 
“CATT => TH", Histidine 
HCAAY => TO", Glutamine 
'CAG' => 'Q', Glutamine 
'CGA' => 'R', Arginine 
"CGC* => TR", Arginine 
"CGG' => 'R', Arginine 
"CGE" => "R', Arginine 
TATA! => TIL", Isoleucine 
TALC) SS UT" # Lsoleucine 
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TATE => 8" Tsoleucine 
'ATG' => 'M', Methionine 
"ACA' => 'T', Threonine 
"ACC! => 'T', Threonine 
"ACG' => 'T', Threonine 
TACT Y -S> VE, Threonine 
"AAC' => 'N', # Asparagine 
"AAT' => 'N', # Asparagine 
"AAA' => 'K', # Lysine 
"AAG' => 'K', Lysine 
"AGC' => 'S', Serine 
'AGT' => 'S', Serine 
"AGA' => 'R'!, Arginine 
"AGG' => 'R!, Arginine 
'GTA' => 'v', Valine 
'GTC' => 'v', Valine 
'"GTG' => 'v', Valine 
'GTT' => 'v', Valine 
'GCA' => 'A', Alanine 
"GCC' => 'A', Alanine 
"GCG' => 'A', Alanine 
'GCT' => 'A', Alanine 
"GAC' => 'D', Aspartic Acid 
"GAT' => 'D', Aspartic Acid 
"GAA! => "ER", Glutamic Acid 
"GAG' => 'E', Glutamic Acid 
'GGA' => 'G', Glycine 
"GGC' => 'G', Glycine 
"GGG' => 'G', Glycine 
'GGT' => 'G', Glycine 

)e 

# 

# codon2aa 

# 


# A subroutine to translate a DNA 3-character codon to an amino acid 
# Version 3, using hash lookup 


sub codon2aa { 
my($codon) = @ ; 


Scodon = uc $codon; 
if(exists Sgenetic_code{$codon}) { 


return S$genetic_code{$codon}; 
}else{ 





die "Bad codon 'Scodon'!!\n"; 


17 


Now, let's examine the code. First, the module declares its package with a name ( 
Geneticcode) that is the same as the file it is in (Geneticcode.pm), but minus the file 
extension .pm, 


The directives: 


use strict; 
use warnings; 


will appear in all the code. The use strict directive enforces the use of the my directive for all 
variables. The use warnings directive produces useful messages about potential problems in 
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your code. (It is possible to turn both directives off when requiredto avoid annoying warnings 
in your program output, for instance. See the perldiag, perllexwarn, and perlmodlib 
sections of the Perl manual.) 


Finally, there is a subroutine definition for codon2aa. As an argument, this subroutine takes a 
codon represented as a string of three DNA bases and returns the amino acid code 
corresponding to the codon. It accomplishes this by a simple lookup in the hash 

genetic code and returns the result from the subroutine using the return built-in function. 


The codon2aa Subroutine calls die and exits the program when it encounters an undefined 
codon. See the exercises at the end of this chapter for a discussion of the pros and cons of 
this behavior. 


In my earlier book, I defined the hash sgenetic_code within the subroutine codon2aa. That 
meant that every time the subroutine was called, the hash would have to be initialized, which 
took a bit of time. In this version, the hash only has to be initialized once, when the program 
is first called, which results in a significant speedup. The definition of the hash is outside of the 
subroutine definition, but in the namespace of the Geneticcode package. The hash is 
initialized when the Geneticcode.pm module is loaded by this statement: 


use Geneticcode; 
Every subsequent call to the codon2aa subroutine simply accesses the hash without having to 
initialize it each time. 


Here's an example that uses the new Geneticcode module, which is saved in a file called 
testGeneticcode and run by typing perl testGeneticcode: 





use strict; 
use warnings; 


use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 





use Geneticcode; 
my Sdna = 'AACCTTCCTTCCGGAAGAGAG'!; 


# Initialize variables 
my Sprotein = ''; 


# Translate each three-base codon to an amino acid, and append to a protein 


for(my $i=0; Si < (length($dna) - 2) ; Si += 3) { 

Sprotein .= Geneticcode::codon2aa( substr($dna,$i,3) ); 
} 
print Sprotein, "\n"; 


Recall that the Perl built-in function substr can extract a portion of a string. In this case, 
substr extracts from $dna the three characters beginning at the position given in the counter 
variable $i; this three-character codon is then passed as the argument to the subroutine 
codon2aa. This program produces the output: 


NLPSGRE 


1.7.2 Expanding Geneticcode.pm 





Now, let's expand our Geneticcode module example. This new version of the module 
includes a few short helper subroutines. The interest here lies in how the subroutines interact 
with each other in the module's namespace, and how to access the code within the module 
from a Perl program that uses the module. 


Modules are a great way to organize code into logical collections of interacting parts. When 
you create modules, you need to decide how to organize your code into the appropriate 
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collection of modules. Here, we have some subroutines that translate codons into amino 
acids; others read sequence data from files and print it to the screen. This is a fairly obvious 
division of functionality, so let's create two modules for this code. We'll expand the 
Geneticcode module; let's also create a SequenceIO module. Of course, the new module will 
be created in a file called Sequence1IO.pm, and that file will be placed in a directory that Perl 
can findin this case, the same directory in which we've placed the Geneticcode module. 


Here's the code for Geneticcode.pm: 


package Geneticcode; 


use strict; 
use warnings; 


my (%sgenetic_ code) = ( 





"CAT > MS" 5 # Serine 
VECC -S> MSs # Serine 
"TCG => "Ss", # Serine 
"ECT! -S> Me. # Serine 
TETC!. => MEY # Phenylalanine 
as before 
'GAG' => 'E', # Glutamic Acid 
"GGAY => 1G", # Glycine 
'GGC! => 'G"', # Glycine 
'GGG! => 'G"', # Glycine 
'GERT => “hE > # Glycine 


# codon2aa 

# 

# A subroutine to translate a DNA 3-character codon to an amino acid 
# Version 3, using hash lookup 


sub codon2aa { 
my($codon) = @ ; 


Scodon = uc $codon; 


if(exists Sgenetic_code{$codon}) { 
return $genetic_code{$codon}; 


selse{ 
die "Bad codon 'Scodon'!!\n"; 
} 
} 
a 
# dna2peptide 
te 


# A subroutine to translate DNA sequence into a peptide 
sub dna2peptide { 
my($dna) = @ ; 


# Initialize variables 
my Sprotein = ''; 
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# Translate each three-base codon to an amino acid, and append to a 
protein 
for(my S$i=0; $i < (length(Sdna) - 2) ; Si += 3) { 
Sprotein .= codon2aa( substr($dna,$i,3) ); 


} 


return Sprotein; 


} 


# translate frame 
te 


# A subroutine to translate a frame of DNA 
sub translate frame { 


my(S$seq, $start, $end) = @ ; 
my Sprotein; 


# To make the subroutine easier to use, you won't need to specify 

# the end point-it will just go to the end of the sequence 

# by default. 

unless(Send) { 
Send = length($seq) ; 

} 





# Finally, calculate and return the translation 
return dna2peptide ( substr ( S$seq, $start - 1, Send -Sstart + 1) ); 
} 


1; 


Now, we have in one module the code that accomplishes a translation from the genetic code. 
However, we also need to read sequence in from FASTA sequence files, and print out 
sequence (the translated protein) to the screen. Because these needs are likely to recur in 
many programs, it makes sense to make a separate module for just the file reading, 
sequence extraction, and sequence printing operations. (This may even be too much in one 
module; maybe there should be separate modules for each need? See the exercises at the 
end of the chapter.) 


Here's the code for the second module SequenceIO.pm, which handles reading from a file, 
extracting FASTA sequence data, and printing sequence data: 


package Sequencel0O; 


use strict; 
use warnings; 


# get file data 
i# 


# A subroutine to get data from a file given its filename 
sub get_file data { 
my(Sfilename) = @ ; 


# Initialize variables 
my @filedata = ( ); 














open(GET FILE DATA, $filename) or die "Cannot open file 
'Sfilename':$!\n\n"; 








@filedata = <GET_FILE DATA>; 
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close GET FILE DATA; 











return @filedata; 


} 


# extract _sequence from fasta data 
# 


# A subroutine to extract FASTA sequence data from an array 





sub extract sequence from _fasta_data { 





my(@fasta_file data) = @ ; 


# Declare and initialize variables 
my Ssequence = ''; 


foreach my $line (@fasta_file data) { 


# Giscard blank line 
if (Sline =~ /*\s*S/) { 
next; 


# discard comment line 
} elsif ($line =~ /*\s*#/) { 
next; 


# discard fasta header line 
} elsif (Sline =~ /*>/). { 
next; 


keep line, add to sequence string 
} else { 
Ssequence .= $line; 





} 


# remove non-sequence data (in this case, whitespace) from Ssequence 
string 
Ssequence =~ s/\s//g; 


return Ssequence; 


} 
# print sequence 
: A subroutine to format and print sequence data 
sub print sequence { 
my(S$sequence, S$length) = @ ; 
# Print sequence in lines of Slength 


for (my Spos = 0 ; Spos < length(Ssequence) ; Spos += Slength ) { 
print substr(Ssequence, Spos, Slength), "\n"; 


1; 
Before we discuss the code, let's see a small program that uses it: 


# Translate a DNA sequence into one of the six reading frames 
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SEELCE 
use warnin 
use 
use 
use 


Geneti 
Sequen 





y @file d 
y Sdna = 

y Srevcom 
y Sprotei 


ao & Ss 


Read in 
@file data 





Extract 


, 


Qs; 


ccode; 
celO; 


ata = ( )? 


Ua ee 
rg 
= PBs 
sf 
n= 't-. 
, 


Initialize variables 


the contents of the file "sample.dna" 


= Sequ 





lib "/home/tisdall/MasteringPerlBio/development/1lib"; 


ncelO::get_ file data("sample.dna"); 


the sequence data from the contents of the file "sample.dna" 





Sdna = Seq 











print “\n 


Sprotein = 


SequencelO: :print_sequence ($protein, 


exit; 


uencelO:: 





Geneticcode::translate frame ($dna, 


Here's the input file: 


> sample dna 





TO) 


1) 


(This is a typical fasta header.) 


agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg 
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct 
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc 
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgcetggcgggggt 
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc 


cagagcctccagat 


tgccggggaggacagcaagtccgagaatggggagaat 


gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat 


cgggtgtgacaact 
ctgagaagatggccaaggccat 
gagaaagaccccaagctagaga 





ggcgcaagaggcctgtccctg 
gggacaggggttggggccatg 
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc 
agcagcagcagatcaaacggtcagcccgcatgtgtggtgag 
tgtcggcgcactgaggactgtggtcactg 





gaagttcgggggccccaacaaga 
gccagctgcgggcccgggaatcg 





tgcaatgag 


at 
ok 








tggttccatggggactgcatccggatca 
ccgggagtggtactgtcgggagtgcaga 
ttcgctatcggcacaagaagtcacggga 
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag 
tccagacctgcagcgccgggcagggtca 


tgctcggggctctgcttcgccccacaa 


tgtgaggca 


tgatttctgtcgggacatgaa 


tccggcagaagtgccggctgcgccagt 
Lacaagqtacticceticcteqcticica 


ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac 


ccaacagcagccacagccatcacagaagttagggcgca 
agggggcagtggcgtcatcaacagtcaaggagcctcc 








acacctgagccactctcagatgaggaccta 


Here's the output of the program: 


RWRR_GV 


iGA 


1GRPPTGLOR 





CAGSGDMEG 


tccgtgaagatg 
tgaggctacagcc 





RRRMGPAQ EYAAWEA LEA 





EVVVGAFATAW 








DGSDPEPPDAG 


E DSKSENG 





ENAPIYCICRKP 














ITREWYCREC 








REKDPKLEIRY 





GSAS PHKSS 
LROCQLRAR 





POPLVATPSOH 
ESYKYFPSSLS 











RHKKSRER 





DGNERDSSEPRD 


EGGGRKRPVPD 





HOOQOOOT 
PVTPSESL 

















KRSARMCGECEACRRT 











PRPRRPLPTQOQO 





POPSQKLGRIR 


xtract_ sequence from fasta _data(@file data); 


Translate the DNA to protein in one of the six reading frames 
and print the protein in lines 70 characters long 


DAAEWSVOVRGS 


x 


DINCFMIGCDNCNEWFHGDCIRIT 
PDLORRAGSGTGVGAMLAR 
EDCGHCDFCRDMKKFGGPN 














E DEGAVASSTVK 





AAGVVRE 








EKMAKA 


KIRQKCR 








G 





PP 





BATA 
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TPEPLSDEDL 











A few comments are in order. First, the subroutines for translating codons are in the 
Geneticcode module. They include the hash sgenetic_code and the subroutines codon2aa, 
dna2peptide, and translate frame, which are involved with translating DNA data to 
peptides. The subroutines for reading sequence data in from files, and for formatting and 
printing it to the screen, are in the SequenceIO module. They are the subroutines 

get_file data, extract_sequence from _fasta_data, and print sequence. 


Now, we have two modules and code that exercises them; let's look at some more facets of 
using modules. 


[ 
Tea 
m 
LiB ] 
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1.8 Using Modules 


So far, the benefit of modules may seem questionable. You may be wondering what the 
advantage is over simple libraries (without package declarations), since the main result seems 
to be the necessity to refer to subroutines in the modules with longer names! 


1.8.1 Exporting Names 


There's a way to avoid lengthy module names and still use the short ones if you place a call 
to the special Exporter module in the module code and modify the use MODULE declaration in 
the calling code. 








Going back to the first example Geneticcode.pm module, recall it began with this line: 


package Geneticcode; 


and included the definition for the hash genetic_code and the subroutine codon2aa. 


If you add these lines to the beginning of the file, you can export the symbol names of 
variables or subroutines in the module into the namespace of the calling program. You can 
then use the convenient short names for things (€.g., codon2aa instead of 
Geneticcode::codon2aa). Here's a short example of how it works (try typing perldoc 
Exporter to see the whole story): 





package Geneticcode; 





require Exporter; 
@ISA = qw(Exporter); 





@EXPORT OK = qw(...); # symbols to export on request 


Here's how to export the name codon2aa from the module only when explicitly requested: 








@EXPORT_OK = qw(codon2aa) ; # symbols to export on request 


The calling program then has to explicitly request the codon2aa symbol like so: 





use Geneticcode qw(codon2aa) ; 
If you use this approach, the calling program can just say: 
codon2aa(S$codon) ; 


instead of: 


Geneticcode::codon2aa(Scodon) ; 





The Exporter module that's included in the standard Perl distribution has several other 
optional behaviors, but the way just shown is the safest and most useful. As you'll see, the 
object-oriented programming style of using modules doesn't use the Export facility, but it is a 
useful thing to have in your bag of tricks. For more information about exporting (such as why 
exporting is also known as "polluting your namespace"), see the Perl documentation for the 
Exporter module (by typing perldoc Exporter at a command line or by going to the 
http://www.perldoc.com web page). 
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1.9 CPAN Modules 


The Comprehensive Perl Archive Network (CPAN, http://www.cpan.org) is an impressively 
large collection of Perl code (mostly Perl modules). CPAN is easily accessible and searchable 
on the Web, and you can use its modules for a variety of programming tasks. 





By now you should have the basic idea of how modules are defined and used, so let's take 
some time to explore CPAN to see what goodies are available. 


There are two important points about CPAN. First, a large number of the things you might 
want your programs to do have already been programmed and are easily obtained in 
downloadable modules. You just have to go find them at CPAN, install them on your 
computer, and call them from your program. We'll take a look at an example of exactly that 
in this section. 


Second, all code on CPAN is free of charge and available for use by a very unrestrictive 
copyright declaration. Sound good? Keep reading. 


CPAN includes convenient ways to search for useful modules, and there's a CPAN.pm module 
built-in with Perl that makes downloading and installing modules quite easy (when things work 
well, which they usually do). If you can't find CPAN.pm, you should consider updating your 
current version. 


You can find more information by typing the following at the command line: 
perldoc CPAN 


You can also check the Frequently Asked Questions (FAQ) available at the CPAN web site. 
1.9.1 What's Available at CPAN? 


The CPAN web site offers several "views" of the CPAN collection of modules and several 
alternate ways of searching (by module name, category, full text search of the module 
documentation, etc.). Here is the top-level organization of the modules by overall category: 


Development Support 
Operating System Interfaces 
Networking Devices IPC 

Data Type Utilities 
Database Interfaces 

User Interfaces 

Language Interfaces 

File Names Systems Locking 
String Lang Text Proc 

Opt Arg Param Proc 
Internationalization Locale 
Security and Encryption 
World Wide Web HTML HTTP CGI 
Server and Daemon Utilities 
Archiving and Compression 
Images Pixmaps Bitmaps 

Mail and Usenet News 
Control Flow Utilities 

File Handle Input Output 
Microsoft Windows Modules 
Miscellaneous Modules 
Commercial Software Interfaces 
Not In Modulelist 


1.9.2 Searching CPAN 
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CPAN's main web page has a few ways to search the contents. Let's say you need to 
perform some statistics and are looking for code that's already available. We'll go through the 
steps necessary to search for the code, download and install it, and use the module in a 
program. 


At the main CPAN page, look for "searching" and click on search.cpan.org. If you search for 
"statistics" in all locations, you'll get over 300 hits, so you should restrict your search to 
modules with the pull-down menu. You'll get 25 hits (more by the time you read this); here's 
what you'll see: 


1. Statistics::Candidates 
Statistics-MaxEntropy-0.9 - 26 Nov 1998 - Hugo WL ter Doest 





2. Statistics: ChaSquare 
How random is your data? 
Statistics-ChiSquare-0.3 - 23 Nov 2001 - Jon Orwant 


3. Statistics: : Contingency 
Calculate precision, recall, Fl, accuracy, etc. 
Statistics-Contingency-0.03 - 09 Aug 2002 - Ken Williams 


4. Statistics::DEA 
Discontiguous Exponential Averaging 
Statistics-DEA-0.04 - 17 Aug 2002 - Jarkko Hietaniemi 











5. Statistics::Descriptive 
Module of basic descriptive statistical functions. 
Statistics-Descriptive-2.4 - 26 Apr 1999 - Colin Kuskie 


6. SLALLStices: + Distr buttons 

Perl module for calculating critical values of common statistical 
distributions 

Statistics-Distributions-—0.07 - 22 Jun 2001 - Michael Kospach 











7. Statistics::Frequency 
simple counting of elements 
Statistics-Frequency-0.02 - 24 Apr 2002 - Jarkko Hietaniemi 


8. Statistics::GaussHelmert 
General weighted least squares estimation 
Statistics-GaussHelmert-—0.05 - 18 Apr 2002 - Stephan Heuel 





9... Statistics. LLU 
An implementation of Linear Threshold Units 
Statistics-LTU-2.8 - 27 Feb 1997 - Tom Fawcett 





10. Statistics::Lite 
Small stats stuff. 
Statistics-Lite-1.02 - 15 Apr 2002 - Brian Lalonde 











11. Statistics: :MaxEntropy 
Statistics-MaxEntropy-0.9 - 26 Nov 1998 - Hugo WL ter Doest 





12. Statistics: OLS 
perform ordinary least squares and associated statistics, v 0.07. 
Statistics=OLS-0.07 = 13 Oct 2000 = Sanford Morton 


13. Statisties:-: ROC 
receiver-operator-characteristic (ROC) curves with nonparametric confidence 
bounds 


Statistics-ROC-0.01 - 22 Jul 1998 - Hans A. Kestler 


14. Statistics::Regression 
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weighted linear regression package (linetplane fitting) 
StatisticsRegression - 26 May 2001 - ivo welch 











15. Statistics: :SparseVector 
Perl5 extension for manipulating sparse bitvectors 
Statistics-MaxEntropy-0.9 - 26 Nov 1998 - Hugo WL ter Doest 





16. Statistics: :Descriptive: :Discrete 
Compute descriptive statistics for discrete data sets. 
Statistics-Descriptive-Discrete-0.07 - 13 Jun 2002 - Rhet Turnbull 


17. Bio::Tree::Statistics 
Calculate certain statistics for a Tree 
bioperl-1.0.2 - 16 Jul 2002 - Ewan Birney 





18. Device::ISDN::OCLM::Statistics 
OCLM statistics superclass 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 


19. Device::ISDN::OCLM::CurrentStatistics 
OCLM current call statistics 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 


20. Device::ISDN::O0CLM::ISDNStatistics 
OCLM ISDN statistics 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 








21. Device::ISDN::OCLM: :Lastl0Statistics 
OCLM Last10 call statistics 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 





22. Device::ISDN::OCLM::LastStatistics 
OCLM last call statistics 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 


23. Device::ISDN::OCLM::ManualStatistics 
OCLM manual call statistics 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 








24. Device::ISDN::O0CLM::SPStatistics 
OCLM service provider statistics 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 


25. Device::ISDN::OCLM::SystemStatistics 
OCLM system statistics 
Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 














Let's check out the Statistics: :ChiSquare module. 


First, click on the link to Statistics::ChiSquare; you'll see a summary of the module, 
complete with a description, overview, discussion of the method, examples of use, and 
information about the author. 


One of the modules looks interesting; let's download and install it. How big is the source 
code? If you click on the source link, you'll find that the module is really just one short 
subroutine with the documentation defined right in the module. Here's the subroutine 
definition part of the module: 


package Statistics::ChiSquare; 


# ChiSquare.pm 
ih 
# Jon Orwant, orwant@media.mit.edu 


# 
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to fix an off-by-one error 





modify it under the same terms as Perl itself. 





Version 0.3. Module list status is "Rdpf" 


use strict; 
use vars gw(SVERSION @ISA @EXPORT) ; 














require Exporter; 
require AutoLoader; 

















@ISA = qw(Exporter AutoLoader) ; 

# Items to export into callers namespace by default. 
# names by default without a very good reason. Use 

# Do not simply export all your public functions/me 
@EXPORT = qw(chisquare) ; 

SVERSION = '0.3'; 

my @chilevels = (100, 99, 95, 90, 70, 50, 30, 10, 5 
my Schitable = ( ); 


31 Oct 95, revised Mon Oct 18 12:16:47 1999, and again November 2001 


Copyright 1995, 1999, 2001 Jon Orwant. All rights reserved. 
This program is free software; you can redistribute it and/or 


Note: do not export 


EXPORT OK instead. 





thods/constants. 


, A)? 


# assume th xpected probability distribution is uniform 








sub chisquare { 
my @data = @ ; 


@data = @{$data[0]} if @data = = 1 and ref ($data[0]); 
my Sdegrees of freedom = scalar(@data) - 1; 
my (Schisquare, $num_samples, Sexpected, $i) = (0, 0, 0, 0); 


if (! exists ($chitable{$degrees of freedom})) { 
return "I can't handle ", scalar(@data), 
" choices without a better table."; 
} 
foreach (@data) { Snum_samples += $_ } 
Sexpected = Snum_samples / scalar(@data) ; 
return "There's no data!" unless Sexpected; 
foreach (@data) { 


Schisquare += (($_ - Sexpected) ** 2) / Sexpected; 


} 
foreach (@{$chitable{$degrees of freedom}}) { 





if ($chisquare < $_) { 
return 
"There's a <Schilevels[$i+1]% and <Schilevels[$i]% chance that 
this data 
1s Kandom., "y 
} 
Sit? 


} 


return "There's a <Schilevels[S#chilevels]% chance that this data is 


random. "? 


} 


Schitable{1} = [0.00016, 0.0039, 0.016, 0.15, 0.46, 
Schitable{2} = [0.020, O20, Oc2le Oe tiy. Le3dy, 
Schitable{3} = [0.12, O23 Dy Oro6, beta, 2230) 
Schitable{4} = [0.30, Dist Ly LOGy. Baty Dadi, 
Schitable{5} = [0.55, 1.14, lv61ly 26000), 2239 
1209 | 3 

Schitable{6} = [0.87, 1.64, 2.20 p) D0 Sy De dDy 
168.1] 7 

Schitable{7} = [1.24, Zed dy 2683, BwOTy O25 





Ls Ol gp Zetl, 36-04, 62641] 4 

204; 4560, 5499, 9.21 )¢ 

S8Ol ge Os20- Te82,° Ll Say 
4.98%. belS, 92S, Ase2317 
C06; Oo2ay- DOr, 


Let sip LO s6Sy M39, 


S307. LAA, 14.07, 
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18.48]; 

Schitable{8} = [1.65, 2213; S249 “5.53, Ta2sdy BeSZ2, 13296, 15251, 
20.209] F 

Schitable{9} = [2.09, 2233; 4237, 6239; 8.34, 10.66, 14:68, 16.92, 
21267] # 

Schitable{10} = [2.56, 3.94, A286, Ts27;, 9.34, 11.78) 15.99, 18.31, 
23:.21 5 

Schitable{11} = [3.05, 4.58, 5.58, S215, 10.34, 12.90, 17.28, 19.68, 
24.73) 3 

Schitable{12} = [3.57, 5a2Sy 6.30, 9.03, 21.34, 14.01), 18.55, 21.03, 
26.22 3 

Schitable{13} = [4.11, 5.89, 7.04, 92.93, 12.34, 15.12, 19.81, 22.36, 
216917 

Schitable{14} = [4.66, 6257, Ted9e DOes2, 13534, -26.22;, 2Ls06,, 23.69; 
29.14); 

Schitable{15} = [5.23, 7s26>. 8255, DLL.72; 14234, 2.327 222.391,. 25.00, 
30.58]? 

Schitable{16} = [5.81, T=96;, OS: 8, 12262; 15.34, 18.42, 23.54, 26.30, 
32.00] 7 

Schitable{17} = [6-41], 8:67, 10.09, 13.53, 16.34, 19.51, 24.77, 27.59; 
33.41]; 

Schitable{18} = [7.00, 9.39, 10.87, 14.44, 17.34, 20.60, 25.99, 28.87, 
34.81]; 

Schitable{19} = [7.63, 10.12, 11.65, 15.35, 18.34, 21.69, 27.20, 30.14, 
36.19]; 

Schitable{20} = [@.26, 10.85, 12.44, 16.27, 19.34, 22.78, 28.41, 31.41, 
SI 2 STF 

1; 


Some of this code will look familiar; some may not. Check out the use of package, use 
strict, and require Exporter; they're parts of Perl you've just seen. 





You'll also see references to version, Autoloader, use vars, and an initialization of a 
multidimensional array chitable, which will be covered later. For now, you may want to take 
a quick read-through of the code and get some personal satisfaction at how much of it 
makes sense. 


Indeed, one of the really nice things about most modules is that you don't really have to read 
the code very often. Usually you can just install the module, read enough of the 
documentation to see how to call it from your program, and you're off and running. Let's 
take that approach now. 


1.9.3 Installing Modules Using CPAN.pm 


Our next task is to install the module using CPAN.pm. This section contains a log from when I 
installed Statistics: :ChiSquare on my Linux computer using CPAN.pm. 


In fact, to make things easy, here's the section of the CPAN FAQ that addresses installing 
modules: 


How do I install Perl modules? 


Installing a new module can be as simple as typing 





perl -MCPAN -e ‘install Chocolate::Belgian'. 


The CPAN.pm documentation has more complete instructions on how to use 

this convenient tool. If you are uncomfortable with having something 

take that much control over your software installation, or it otherwise 
doesn't work for you, the perlmodinstall documentation covers 

module installation for UNIX, Windows and Macintosh in more familiar terms. 
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Finally, if you're usin 
Manager) 
has much of the same functionality as CPAN.pm. 


g ActivePerl on Windows, the PPM (Perl Package 





The following is my install log. Notice that all I have to do is type a couple of lines, and 
everything else that follows is automatic! 


oing 





CPAN: 


assword: 
root@coltrane tisdall]# perl -MCPAN -e ‘install Statistics::ChiSquare' 
PAN: Storable loaded ok 
to read /root/.cp 
Database was generated on Wed, 
LWP::UserAgent lo 


Fetching with LWP: 


Going 


ftp://cpan.cse.msu. 





CPAN: Compress::Zlib lo 
Fetching with LWP: 





There's a new CPAN.pm version 


ftp: / /cpam. cse.msu « 
Going to read /root/. 
Database was generated on Mon, 


Permission denied at 


an/Metadata 


aded ok 


aded ok 


[Current version is v1.59 54] 
You might want to try 


install 








without quitting th 


reload cpan 


Bundle: :CPAN 





Fetching with LWP: 
ftp://cpan.cse.msu.ed 
Going to read /root/.cpan/sources/modules/03modlist.data.gz 
Going to write /root/.cpan/Metadata 
Running install for mod 
Running make for J/JO/JONO/Statistics-ChiSquare-0.3.tar.gz 
Fetching with LWP: 
ftp://cpan.cse.msu.ed 
CPAN: MD5 loaded ok 
Fetching with LWP: 
ftp://cpan.cse.msu.ed 


Checksum for 


feOooEs 


Scanning 





[tisdall@coltrane tisdall]$ perl -MCPAN -e ‘install Statistics::ChiSquare' 
CPAN: Storable loaded ok 

mkdir /root/.cpan: 

line 2218 

[tisdall@coltrane tisdall]$ su 
P 

[ 

G 

G 











20. Mar 2002 00239:29 GMT 


edu/authors/Olmailre.txt.gz 
to read /root/.cpan/sources/authors/Olmailre.txt.gz 


edu/modules/02packages.details.txt.gz 
cpan/sources/modules/02packages.details.txt.gz 
26 Aug 2002 00:22:07 GMT 


(v1.62) available! 


current session. It should be a seamless upgrade 
while we are running... 


ule Statistics 





u/modules/03modlist.data.gz 


::ChiSquare 


u/authors/id/J/JO/JONO/Statistics-ChiSquare-0.3.tar.gz 


u/authors/id/J/JO/JONO/CHECKSUMS 





.cpan/sources/authors/id/J/JO/JONO/Statistics-ChiSquare-0.3. 


taki 2 ik 
cache /root/.cpan/build for sizes 


Deleting from cache: /r 
Deleting from cache: /r 
Deleting from cache: /r 





Statist 
Statist 
Statist 
Statis 
Statis 
Statis 





tics/ChiSquare-0 
tics/ChiSquare-0 
tics/ChiSquare-0O. 
tics/ChiSquare-0O. 
tics/ChiSquare-0O. 
tics/ChiSquare-0O. 





oot/.cpan/build/I0O-stringy-2.108 (21.4>20.0 MB) 





oot/.cpan/build/XML-Node-0.11 (20.8>20.0 MB) 
oot/.cpan/build/bioperl-0.7.2 (20.7>20.0 MB) 





ioe 
.3/ChiSquare.pm 


3/Makefile.PL 
3/test.pl 
3/Changes 
3/MANIFEST 





Package seems to come without Makefile.PL. 
(The test -f "/root/.cpan/build/Sta 
Writing one on our own (setting NAM 


CPAN 


-pm: Going to bui 


ld J/JO/JONO/S 


tistics/Makefile.PL" returned false.) 


= 


EF to StatisticsChiSquare) 





tatistics-ChiSquare-0.3.tar.gz 


/usr/local/lib/perl15/5.6.1/CPAN.pm 


Page 41 


Checking if your kit is complete... 

Looks good 

Writing Makefile for Statistics: :ChiSquare 

Writing Makefile for StatisticsChiSquare 

make[1]: Entering directory */root/.cpan/build/Statistics/ChiSquare-0.3' 
cp Ghd dauare...on ../blib/lib/Statistics/ChiSquare.pm 





AutoSplitting ../blib/lib/Statistics/ChiSquare.pm (../blib/lib/auto/ 
accmog perenne 
ene ome ./olib/man3/Statistics::ChiSquare.3 
make[1]: Leaving directory */root/.cpan/build/Statistics/ChiSquare-0.3' 
fuse/ bin Hake -- OK 
Running make test 
make[1]: Entering directory *“/root/.cpan/build/Statistics/ChiSquare-0.3' 
make[1]: Leaving directory */root/.cpan/build/Statistics/ChiSquare-0.3' 
make[1]: Entering directory */root/.cpan/build/Statistics/ChiSquare-0.3' 
PERL DL _NONLAZY=1 /usr/bin/perl -I../blib/arch -I../blib/lib 
-I/usr/local/lib/ 
perl5/5.6.1/i686-linux -I/usr/local/lib/perl15/5.6.1 test.pl 

















Hh ss gw 
ok 1 
ok 2 
make[1]: Leaving directory */root/.cpan/build/Statistics/ChiSquare-0.3' 
jase bin fie test -- OK 
Running make install 
make[1]: Entering directory */root/.cpan/build/Statistics/ChiSquare-0.3' 
make[1]: Leaving directory */root/.cpan/build/Statistics/ChiSquare-0.3' 
Tested lang /usr/local/lib/perl5/site perl/5.6.1/Statistics/ChiSquare.pm 
Installing /usr/local/lib/perl5/site perl/5.6.1/auto/Statistics/ChiSquare/ 
autosplit.ix 
Installing /usr/local/man/man3/Statistics::ChiSquare.3 
Writing /usr/local/lib/perl5/site perl/5.6.1/i686-linux/auto/ 
StatisticsChiSquare/.packlist 
Appending installation info to 
/usr/local/lib/perl5/5.6.1/1686-linux/perllocal.pod 
/usr/bin/make install UNINST=1 -- OK 
[root@coltrane tisdall]# 

















This may seem like a confusing amount of output, but, again, all you have to do is type a 
couple of lines, and the installation follows automatically. 


You may get something like the following message when you try to install a CPAN module: 
[tisdall@coltrane tisdall]$ perl -MCPAN -e ‘install Statistics::ChiSquare' 
CPAN: Storable loaded ok 


mkdir /root/.cpan: Permission denied at /usr/local/lib/perl5/5.6.1/CPAN.pm 
line 2218 





As you can see, it didn't work, and it produced an error message. On Unix machines, it's often 
necessary to become root to install things.” In that case, use the Unix su command and try 
the CPAN command again: 


Zl You may need to contact your system administrator about getting root permission. The CPAN 


documentation discusses how to do a non-root installation. If you're not on a Unix or Linux machine and are 
using ActiveState's Perl on a Windows machine, for instance, you need to consult that documentation. 


[tisdall@coltrane tisdall]$ su 
Password: 
[root@coltrane tisdall]# perl -MCPAN -e ‘install Statistics::ChiSquare' 


Great, it worked. If you look over the rather verbose output, you'll see that it finds the 
module, installs it, tests it, and logs the installation. 


Pretty easy, huh? 


It's usually this easy, but not always. Occasionally, errors result, and the module may not be 
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installed. In that case, the error messages may be enough to explain the problem; for 
instance, the module may depend on another module you have to install first. Another 
problem is that some modules haven't been tested on, or even designed to work on, all 
operating systems; if you try to install a Windows-specific module on Linux, it is likely to 
complain. In extreme cases, the module documentation usually provides the author's email 
address. 


1.9.4 Using the Newly Installed CPAN Module 


Now comes the payoff. Let's look again at the documentation for the module and see if we 
can use it from our own Perl code. 


Now that the module is installed, you can see the documentation by typing: 
perldoc Statistics::ChiSquare 





You can also simply go back to the web documentation found at http://search.cpan.org. 
Either way, you'll find the following example using this ChiSquare module: 








NAME 
"Statistics::ChiSquare" - How random is your data? 
SYNOPSIS 
use Statistics::Chisquare; 
print chisquare(@array of numbers); 
Statistics::ChiSquare is available at a CPAN site near 
you. 
DESCRIPTION 
Suppose you flip a coin 100 times, and it turns up heads 
70 times. Is the coin fair? 


Suppose you roll a die 100 times, and it shows 30 sixes. 
Is the die loaded? 

In statistics, the chi-square test calculates "how random" 
a series of numbers is. But it doesn't simply say "yes" 
or "no". Instead, it gives you a confidence interval, 
which sets upper and lower bounds on the likelihood that 
the variation in your data is due to chance. See the 
examples below. 


The documentation continues with more discussion and some concrete examples that use 
the module and interpret the results. 


Very often, the SYNOPSIS part of the documentation is all you need to look at. It shows you 
specific examples of how to call the code in the module. In this case, because it's a very 
simple module, there is just one subroutine that can be used. As you see from the 
documentation excerpt, you just need to pass the chisquare subroutine an array of numbers 
and print out the return value to use the code. Let's try it. We'll take as our input an array of 
numbers that corresponds to the stops of the Broadway-7th Avenue local subway train on 
the west side of Manhattan, from 14th Street up to 137th Street in Harlem. (We'll assume 
you didn't run fast enough and missed the A train.) Let's see how random these stops really 
are: 

use strict; 

use warnings; 


use Statistics: :ChiSquare; 

my(@subwaystops) = (14, 18, 23, 28, 34, 42, 50, 59, 66, 72, 79, 86, 96, 103, 
110, 

116, 125; 137)¥ 


print chisquare (@subwaystops) ; 
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This produces the output: 


There's a <1% chance that this data is random. 


(Knowing firsthand the feelings of long-suffering New York City Subway riders, I predict that 
this result might provoke some spirited discussion. Nevertheless, we seem to have working 
code. ) 


1.9.5 Problems with CPAN Modules 


Actually, the sharp-eyed reader may have noticed a problem in our mad dash uptown. In the 
first line of the SYNOPSIS section, there's the following: 


use Statistics: :Chisquare; 


The name of the module is spelled Chisquare, whereas in all other places in the 
documentation the module is spelled ChiSquare with a capital S. In Perl, the case of a letter, 
uppercase or lowercase, is important, and this looks suspiciously like a typographical error in 
the documentation. If you try use Statistics::Chisquare, you'll discover that the module 
can't be found, whereas if you try use Statistics::ChiSquare, the module is there. This is 
a minor bug, but some modules have poor documentation, and it can be a time-consuming 
problem, especially if you are forced to wade into the module code or try various tests, to 
figure out how the module works. 


Apart from bugs, I've also mentioned the problem that some modules are not tested, or 
designed, for all operating systems. In addition, many modules require other modules to be 
present. It's possible to configure CPAN to automatically install all the required modules a 
requested module uses, as described in the CPAN documentation, but you may need to 
intervene personally. It's useful to remember that if you have a program that uses a certain 
module running on one computer, and you move the program to another computer, you 
may have to install the required modules on the new computer as well. 


Saving the worst for last, it's also important to remember that contributing to CPAN is open 
to one and all, and not all the code there is well-written or well-tested. The heavily used 
modules are, but counterexamples can be found. So, don't bet the farm on your code just 
because it uses a CPAN module; you should still carefully read the documentation for the 
module and test your program. 


The CPAN FAQ explains in detail the way to be a good citizen when it comes to testing and 
reporting bugs that you discover in CPAN code. 


[ 
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m 
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1.10 Exercises 


Exercise 1.1 


What are the problems that might arise when dividing program code into separate 
module files? 


Exercise 1.2 
What are the differences between libraries, modules, packages, and namespaces? 
Exercise 1.3 
Write a module that finds modules on your computer. 
Exercise 1.4 
Where do the standard Perl distribution modules live on your computer? 
Exercise 1.5 
Research how Perl manages its namespaces. 
Exercise 1.6 


When might it be necessary to export names from a module? When might it be useful? 
When might it be convenient? When might it be a very bad idea? 


Exercise 1.7 


The program testGeneticcode contains the following loop: 


# Translate each three-base codon to an amino acid, and append to a 
protein 
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { 

Sprotein .= Geneticcode::codon2aa( substr($dna,$i,3) ); 


} 


Here's another way to accomplish that loop: 


# Translate each three-base codon to an amino acid, and append to a 


protein 

my Si=0; 

while (my Scodon = substr($dna, $i += 3, 3) ) { 
Sprotein .= Geneticcode::codon2aa( $codon ); 


} 


Compare the two methods. Which is easier to understand? Which is easier to 
maintain? Which is faster? Why? 


Exercise 1.8 
The subroutine codon2aa causes the entire program to halt when it encounters a 
"bad" codon in the data. Often (usually) it is best for a subroutine to return some 
indication that it encountered a problem and let the calling program decide how to 


handle it. It makes the subroutine more generally useful if it isn't always halting the 
program (although that is what you want to do sometimes). 


Rewrite codon2aa and the calling program testGeneticcode so that the subroutine 
returns some errorperhaps the value undefand the calling program checks for that 
error and performs some action. 


Exercise 1.9 


Write a separate module for each of the following: reading a file, extracting FASTA 
sequence data, and printing sequence data to the screen. 


Exercise 1.10 


Download, install, and use a module from CPAN. 


4 PREVIOUS 


Page 45 


[ Team LiB ] 


Page 46 


[Team LiB J 


Chapter 2. Data Structures and String 
Algorithms 


So far in this book, I've used the standard Perl data structures of scalars, arrays, and hashes. 
However, it is often necessary to handle data with a more complex structure than what those 
basics allow. For instance, it is frequently useful to have a two-dimensional array. 


In this chapter, you'll learn how to define and use references and complex data structures. 
After you learn the fundamentals, you'll apply the new techniques to implement a biologically 
important algorithm. These techniques are also fundamental to the implementation of 
object-oriented programming, as you'll see in Chapter 3. 


The algorithm we'll study is called approximate string matching. It lets you find the closest 
match for a peptide fragment in a protein, for instance. It uses an algorithmic technique called 
dynamic programming, an essential tool for many similar biological tasks, such as aligning 
biological sequences. In this chapter, you'll see how Perl references can be used to write 
programs for data problems with more complex relationships. References are also used for 


the objects of object-oriented programming. 
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2.1 Basic Perl Data Types 


Before tackling references, let's review the basic Perl data types: 
Scalar 


A scalar value is a string or any one of several kinds of numbers such as integers, 
floating-point (decimal) numbers, or numbers in scientific notation such as 2.3E23.A 
scalar variable begins with the dollar sign $, as in Sdna. 


Array 


An array is an ordered collection of scalar values. An array variable begins with an at 
sign @, aS in @peptides. An array can be initialized by a list such as @peptides = 
('zeroth', 'first', 'second'). Individual scalar elements of an array are referred to 
by first preceding the array name with a dollar sign (an individual element of an array is 
a scalar value) and then following the array name with the position of the desired 
element in square brackets. Thus the first element of the @peptides array is 
referenced by $peptides[0] and has the value 'zeroth'. (Note that array elements 
are given the positions O, 1, 2, ..., n-1, where n is the number of elements in the 
array.) 


Recall that printing an array within double quotes causes the elements to be separated 
by spaces; without the double quotes, the elements are printed one after the other 
without separations. This snippet: 


@pentamers = ('cggca', 'tgatc', 'ttggc'); 


print "@pentamers", "\n"; 
print @pentamers, "\n"; 


produces the output: 
eggca tgate ttgge 


cggcatgatcttggc 
Hash 


A hash is an unordered collection of key value pairs of scalar values. Each scalar key is 
associated with a scalar value. A hash variable begins with the percent sign %, as in 
Sgeneticmarkers. A hash can be initialized like an array, except that each pair of 
scalars are taken as a key with its value, as in: 


The => symbol is just a synonym for a comma that makes it easier to see the 
key/value pairs in such lists.“ An individual scalar value is retrieved by preceding the 
hash name with a dollar sign (an individual value is a scalar value) and following the 
hash name with the key in curly braces, as in $geneticmarkers{'hairless'}, which, 
because of how it's initialized, has the value 'no'. 


"1 Tt also forces the left side to be interpreted as a string. 
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2.2 References 


Many computer languages provide variables that allow you to refer to, or point at, other 
values. So, instead of a variable containing data such as a string or number of interest, the 
variable contains the /ocation of the data; it tells you where to go to get the value you want. 
In Perl, the use of a scalar variable to refer to another value is called a reference, and the 
value being pointed at is called a referent. 


References allow you to do many useful things in Perl; you can define multidimensional arrays 
and other more complex data structures and avoid copying large amounts of data (for 
instance, when passing arguments into subroutines). Using references can make your 
programs faster, more efficient, and shorter. References have a number of uses, as you'll see 
in the next sections. 


2.2.1 References to Scalars 


Here's an example of a reference: 
Speptide = 'EIQADEVRL'; 

















Speptideref = \S$peptide; 





print "Here is what's in the reference:\n"; 

print Speptideref, "\n"; 

print "Here is what the reference is pointing to:\n"; 
print ${Speptideref}, "\n"; 

print $Speptideref, "\n"; 


This Perl code produces the following output: 


Here is what's in the reference: 

SCALAR (0x80 fe4ac) 

Here is what the reference is pointing to: 
FE ITQADEVRL 
EF TQADEVRL 























What's going on here? 











First, a string value of ETOADEVRL is assigned to the scalar variable S$peptide. Next, a 
backslash operator is used before the $peptide variable to return a reference to the variable. 
This reference is saved in the scalar variable $peptideref., 





The next lines of code show what this example really does. When you print out the (actual) 
value of the reference variable Speptideref, you get the value: 


SCALAR (0x80fe4ac) 


This says that the reference variable Speptideref is pointing to a scalar value (which is the 
value of the scalar variable $peptide). It also gives a hexadecimal number that specifies 
where in the computer memory the value for that variable resides. 


The 0x at the beginning of the number says that it is a hexadecimal number. Hexadecimal 
(base 16) numbers are a way to specify locations in computer memory. The exact location in 
the computer memory where this $peptide value resides is almost never of practical 
importance to you. However, it can help when debugging code that uses references, and so it 
is displayed when you print the value of a reference as we've just done or when you use the 
Perl ref command (which we'll use later). 


'] Recall that hexadecimal numbers use 16 digits, from 0 to f, and that the decimal (base 10) numbers: 
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2.2.1.1 Dereferencing 


Finally, our code fragment performs the essential task of dereferencing a reference. In Perl a 
reference to a scalar variable can be dereferenced by surrounding it with curly braces {} and 
prepending another dollar sign to it. ${Speptideref} returns the value the reference variable 
is pointing at. The value being pointed at is the same as the value of the S$peptide variable, 
which has the value 'EIQADEVRL', So ${Speptideref} also has the value 'EIQADEVRL'. 


























Surrounding a reference with curly braces before prepending the appropriate symbol ($ for 
scalar, @ for array, % for hash) is generally the best way to dereference reference variables. As 
you start using more intricate references, you'll find that it's often the only way to 
dereference properly. However, for simple reference variables, it is possible to omit the 
additional curly braces. So, our example shows both ways of dereferencing our scalar 
reference: 

S{Speptideref } 

SSpeptideref 


In Perl, every reference must be dereferenced properly in the program (in other words, by the 
programmer) to be useful. Perl doesn't automatically dereference for you, nor can it figure 
out when you want a reference or when you want the value that the reference is pointing to. 
So, it's up to you to specify that you want the value of a reference by prepending a %, @, or $ 
to hash, array, or scalar references, respectively. (And, as just pointed out, you often need to 
surround the reference with curly braces, although for simple references, they can be 
omitted.) 


2.2.1.2 Anonymous data 


A scalar constant can also be referenced, as in the following code: 

















Speptideref = \'EIQADEVRL'; 

print "Here is what's in the reference: \n"; 

print Speptideref, "\n"; 

print "Here is what the reference is pointing to:\n"; 
print ${Speptideref}, "\n"; 


This produces the output: 

Here is what's in the reference: 

SCALAR (0x80 fe4a0) 

Here is what the reference is pointing to: 
EFIQADEVRL 








In this case the reference points directly to a location in memory in which the string value 
EIQADEVRL is being stored. 

















Compare this code with the previous example. The reference was to an existing variable that 
held a scalar value. Think of it as a scalar value with a "name" that is the already existing 
variable. Now, the reference is to a scalar value alone. This scalar value isn't contained in any 
variable; it has no name. Thus, it's called an anonymous referent, which can only be used via 
the reference to it. 


You may well ask, "Why bother?" Anonymous scalars are, for most practical purposes, not 
any more desirable than simple scalar variables. However, anonymous data structures, and 
references to them, are frequently useful, as you shall see. 


2.2.2 References of References 


It is sometimes useful to have references of references. Since a reference is just a variable 
containing a scalar value, it's possible to make a reference to a reference: 


Svalue = 'ACGAAGCT'; 
Srefvalue = \Svalue; 
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Srefrefvalue = \Srefvalue; 





print $value, "\n"; 
print $S$refvalue, "\n"; 
print $S$Srefrefvalue, "\n"; 


This prints out: 


ACGAAGCT 
ACGAAGCT 
ACGAAGCT 


(Notice that here I've omitted the surrounding curly braces from around the references.) You 
can also apply several levels of reference at one go: 


Svalue = 'ACGAAGCT'; 
Srefrefrefvalue = \\$value; 





print $value, "\n"; 
print $$SSrefrefrefvalue, "\n"; 





This prints out: 


ACGAAGCT 
ACGAAGCT 


2.2.3 References to Arrays 


References to arrays obey pretty much the same syntax as references to scalars. You make 
a reference to an array by prepending a backslash to the @ sign; you dereference the array by 
surrounding the reference variable with curly braces and prepending an @ sign, as in the 
following example: 











@pentamers = ('cggca', 'tgatc', 'ttggc'); 

Sarrayref = \@pentamers; 

print "Here is what's in the reference:\n"; 

print Sarrayref, "\n"; 

print "Here is what the reference is pointing to:\n"; 
print "@{Sarrayref}\n"; 

print "Here is the second value in the array:\n"; 
print ${Sarrayref}[1], "\n"; 





This Perl code produces the following output: 


Here is what's in the reference: 

ARRAY (0x80fe4c4) 

Here is what the reference is pointing to: 
cggca tgate ttgge 

Here is the second value in the array: 
tGace 





An important point to remember here is that Perl doesn't automatically know if you want the 
data a reference is pointing to or the reference variable itself. If it's pointing to a scalar value, 
it's up to you to prepend the $ sign to the reference in order to dereference the value. 
Similarly, as in this example, if a reference is pointing to an array value, it's up to you to 
prepend the @ sign to the reference to dereference the value; @{Sarrayref} is correct; 
@Sarrayref is also okay. 


On the other hand, if you want the value of one element of the referenced array, you prepend 
a dollar sign because the value will be a scalar value. Recall that to get the scalar value of one 
element of an array you use a dollar sign; for example, $array[0]. Similarly, to get the scalar 
value of one element from an array with a reference, you prepend a dollar sign; for example, 

S{Sarrayref}[0] or $Sarrayref[0]. 
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2.2.3.1 The arrow operator 


References to arrays (and references to hashes and subroutines) can be dereferenced using 
another syntax that's popular and important to learn. If Sarrayref is a reference to an array, 
then to dereference the second element (for instance) of that array, you can say either: 


SSarrayref [1] 

or, equivalently: 

Sarrayref->[1] 

The following code fragment shows this: 
@pentamers = ('cggca', 'tgatc', 'ttggc'); 





Sarrayref = \@pentamers; 


print "Here is the second element of the pentamers array:\n"; 
print $Sarrayref[1i], "\n"; 





print "And here it is again:\n"; 
print Sarrayref->[1], "\n"; 


This code prints out: 


Here is the second element of the pentamers array: 
tgatec 

And here it is again: 

bgate 





The arrow operator appears between the name of the reference to an array and the square 


brackets and subscript. It works similarly with hashes and with subroutines, as you'll see later. 


As a convenient shortcut, it is sometimes possible to drop multiple arrow operators in a 
reference. Thus, if: 


Sarray = [ [ 'Dennis', 'Drayna' ], [ 'Callum', 'Bell' ] ]; 
the following are synonymous: 


print S$Sarray[1][2]; 
print Sarray->[1] [2]; 
print Sarray->[1]->[2]; 





Here's the output: 
BellBellBell 





I'll show more examples of this shortcut later in this chapter. 
2.2.3.2 Anonymous arrays 


You can create an anonymous array by surrounding a list with square brackets. (A mnemonic 
device to remember this bit of syntax is that square brackets are also used with arrays to 
refer to a particular element, as in $arr[4].) You can then create a reference to the 
anonymous array like so: 


Spentamers = ['cggca', 'tgatc', 'ttggc']; 


print "The third and last element of the array is ", Spentamers->[2], "\n"; 
This gives the output: 
The third and last element of the array is ttggc 


In this case, Spentamers is a reference to an (anonymous) array. The third element can 
equally well be printed using $Spentamers [2]. The entire array is named by prepending an @ 
sign: 

Spentamers = ['cggca', 'tgatc', 'ttggc']; 


print "The third and last element of the array is S$Spentamers[2]\n"; 
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print "The entire array is: @Spentamers\n"; 


This produces the output: 


The third and last element of the array is ttggc 
The entire array is: cggca tgatc ttggc 


2.2.4 References to Hashes 


References to hashes also follow the same rules as references to scalars and arrays. You 
make a reference to a hash by prepending a backslash to the % sign; you dereference by 
prepending the percent sign to the dollar sign on the reference variable: 


sgeneticmarkers = ('curly' => 'yes', ‘hairy' => 'no', 'topiary' => 'yes'); 
Shashref = \%geneticmarkers; 


print "Here is what's in the reference:\n"; 
print Shashref, "\n"? 


print "Here is what the reference is pointing to:\n"; 
foreach $k (keys %Shashref) { 

print "key\tS$k\t\tvalue\tS$hashref{$k}\n"; 
} 


print "Dereferencing using the arrow operator:\n"; 
foreach $k (keys %Shashref) { 

print "key\t$k\t\tvalue\tShashref->{$k}\n"; 
} 


This Perl code produces the following output: 


Here is what's in the reference: 
HASH (0x80fe4c4) 
Here is what the reference is pointing to: 





key topiary value yes 
key curly value yes 
key hairy value no 
Dereferencing using the arrow operator: 

key topiary value yes 
key curly value yes 
key hairy value no 





Notice that the keys are printed in a different order than they were specified: hashes do not 
preserve the order of their keys. (Also recall that, in a double quoted string, \t prints a tab 
space. ) 


If you want one value of the referenced hash, you prepend a dollar sign because the value will 
be a scalar value. To get one value of a hash, you use a dollar sign, e.g., 
$geneticmarkers{'curly'}; to get one value from a reference to a hash, you also use a 
dollar sign, é.g., $Shashref{'curly'}. 


The arrow operator -> works with hashes the way it works with arrays. With hashes, the 
arrow operator is placed between the name of the hash and the curly braces. To illustrate: 


$geneticmarkers = ('curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes'); 
Shashref = \%geneticmarkers; 

print "For key 'curly' the value is '", S$Shashref{'curly'}, "'\n"; 

print "For key 'curly' the value is '", Shashref->{'curly'}, "'\n"; 

This prints: 

For key 'curly' the value is 'yes' 

For key 'curly' the value is 'yes' 





2.2.4.1 Anonymous hashes 
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You can create an anonymous hash by surrounding a list with curly braces. (The mnemonic 
device to remember this bit of syntax is that curly braces are also used with hashes to refer 
to a particular key, as in Shash{'curly'}.) You can then create a reference to the 
anonymous hash like so: 


Sgeneticmarkers = {'curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes'"}; 
print "Here is what is in the anonymous hash:\n"; 
foreach $k (keys %Sgeneticmarkers) { 


print "key\t$k\tvalue\t$geneticmarkers->{S$k}\n"; 
} 


This gives the output: 


Here is what is in the anonymous hash: 


key topiary value yes 
key curly value yes 
key hairy value no 


In this case, Sgeneticmarkers is a reference to an (anonymous) hash. The values can 
equally well be printed using $$geneticmarkers{$k}or $geneticmarkers->{$k}. 


Curly braces can also be used for blocks and for subroutine definitions. The Perl interpreter 
can occasionally get confused as to which of these constructs is meant, although it's rare. To 
be clear, you can put a plus sign + in front of an anonymous hash to specify that it is an 
anonymous hash and not a block: 

Sanonhash = +{ one’ => 1, "two' => 2 }> 

print "The old SSanonhash{'one'} Sanonhash->{'two'}\n"; 


This prints: 
The old 1 2 


2.2.5 References to Subroutines 


References to subroutines are yet another way to reference in Perl. This may seem a little 
odd. References to scalars, arrays, and hashes are references to data structures. But 
references to subroutines? A subroutine isn't a data structure, so how did this come about? 


There are two reasons why references to subroutines make sense the same way that 
references to data structures make sense. The first reason is that just as variables are 
managed with Perl's symbol tables, so also are subroutine definitions managed by the symbol 
table. In Chapter 3, you'll see the deliberate manipulation of a symbol table to make 
subroutine definitions on the fly. In this sense, subroutines, hashes, arrays, and scalars all 
refer to data that has a name. 


The second reason is that references to subroutines are sometimes a great tool to use when 
writing a program. There are times when you might apply one of a number of different 
subroutines depending on the program logic and the input, and using references to 
subroutines can make this kind of code easier to write. That's the real justification for just 
about everything you might find in the toolbox that we call a programming language, right? (I 
admit it sometimes seems that sheer orneriness was the motivation.) 


References to subroutines follow the same rules as references to scalars and arrays. Recall 
that a subroutine name may optionally be prepended with the ampersand sign & when it is 
called.“! Thus, these two are equivalent: 


3 . - . 
&) The ampersand was required in older versions of Perl. 


findmotif ("ATTAATTTTCCGATC') ; 

&findmotif ('ATTAATTTTCCGATC'); 

To make a reference to a subroutine, you prepend a backslash to the ampersand: 
Ssubref = \&findmotif; 
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You dereference a subroutine one of two ways: by prepending an ampersand to the 
subroutine reference, like so: 


&Ssubref( ); 

or by using the arrow operator, like so: 

Ssubref->( ); 

This is demonstrated by the following code fragment (which includes a subroutine definition): 
print "Mark 1:\n"; 


findmotif ('ATTAATTTTCCGATC'); 


print "Mark 23\n"; 
&findmotif ("'ATTAATTTTCCGATC'); 











print "Mark 3:\n"; 
Ssubref = \&findmotif; 
&Ssubref ('ATTAATTTTCCGATC'); 


print "Mark 4:\n"; 
Ssubref = \&findmotif; 
Ssubref->('ATTAATTTTCCGATC'); 


print “Mark S<\n" 
Ssubref2 = \findmotif; 
&Ssubref2 ('ATTAATTTTCCGATC') ; 


sub findmotif { 





my(Sinput) = @ ; 
if (Sinput =~ /CCGA/) { 

print "I found CCGA! \n"; 
}else{ 

print "No motif\n"; 


} 
} 


This produces the output: 


Mark 1: 

I found CCGA! 

Mark 2: 

I found CCGA! 

Mark 3: 

I found CCGA! 

Mark 4: 

I found CCGA! 

Mark 5: 

Not a CODE reference at - line 17. 

















This code defines a little subroutine findmotif that looks for a short motif in DNA sequence 
data. The first two calls to the subroutine simply demonstrate that you can call subroutines 
with or without a leading ampersand «. The third calls the subroutine by means of a reference 
to the subroutine, as just described. The fourth call is by means of a reference to the 
subroutine using the alternative arrow operator. Finally, the fifth call produces an error; the 
problem is just a syntactical one; it tries to take a reference to a subroutine by prepending 
the backslash to the name of the subroutine without including the leading ampersand. 


It's useful to remind the gentle reader that the error produced by that fifth call to findmotif 
occurs only if you don't use the use strict directive (as you are encouraged always to do). 
Without use strict, the program fails only when it reaches that bad call. With use strict, 
the program complains and fails immediately. What if that call isn't made until several hours 
into the running program (which is not an uncommon running time in bioinformatics)? use 
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strict can save a lot of time and effort. 


2.2.5.1 Anonymous subroutines 


You can create an anonymous subroutine by giving the keyword sub followed by a subroutine 


definition within the usual curly braces, followed by a semicolon. An anonymous subroutine 
definition is just like a normal subroutine definition, except the name of the subroutine is 
omitted, and you must follow it with a semicolon. (Recall that subroutine definitions normally 
are not followed by a semicolon, as with the subroutine findmotif in the previous example.) 


You can create a reference to the anonymous subroutine like so: 
Sfindmotif = sub { 





my(Sinput) = @ ; 
if(Sinput =~ /CCGA/) { 

print "I found CCGA! \n"; 
}else{ 

print "No motif\n"; 


} 
1? 


Sfindmotif->('ATTAATTTTCCGATC'); 
&$findmotif ('ATTAATTTTCCGATC') ; 


This gives the output: 
I found CCGA! 
I found CCGA! 


In this case, $findmotif is a reference to an (anonymous) subroutine. The subroutine 
reference was dereferenced and called twice to show the use of the two alternative choices 
of syntax: the prepended ampersand and the arrow operator. 


2.2.5.2 Passing references to subroutines 


Perl collapses all arguments to a subroutine as a list of scalars. This makes it impossible to 
distinguish between, say, two arrays you might try to pass to a subroutine, as the following 
example illustrates: 

@Gaminoacidsl = ("H", "VW", "h')? 

GCaminoacrzds?. = (rp, MEV, YY") 





printacids(@aminoacidsl, @aminoacids2) ; 


sub printacids { 





my(@aal, @aa2) = @ ; 
print "Amino acids 1\n"; 
print "@aal\n"; 

print "Amino acids 2\n"; 
print "@aa2\n"; 


} 
This prints out: 


Amino acids 1 
h Vb DT ¥ 
Amino acids 2 





As you can see, the elements of both arrays are passed to the subroutine by means of the 
special array @_, and Perl assigns this entire array to the first local array @aal. 


In order to pass an arbitrary list of any combination of scalars, arrays, or hashes to a 
subroutine, it's necessary to pass the values as references. Here's how to fix the previous 
example: 
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@aminoacidsl 


(ge ae yy i A 
@aminoacids2 = (' 


Dp; EY ye 





printacids(\@aminoacidsl, \@aminoacids2) ; 


sub printacids { 





my(S$aal, S$aa2) = @ ; 
print "Amino acids 1\n"; 
print "@S$aal\n"; 

print "Amino acids 2\n"; 
print "@Saa2\n"; 


} 
This prints out: 


Amino acids 1 
EV L 
Amino acids 2 
TE oY 





In this version, the subroutine is passed references to the arrays. Inside the subroutine, the 
references are collected in the variables $aal and $aa2 and are dereferenced to print out their 
contents using the forms @Saal and @$aaz2. 


Even when you're passing just one scalar to a subroutine, you might want to pass a 
reference. Say you have the DNA sequence of human chromosome 1 in a variable $chroml. 
You want to pass this sequence into a subroutine that searches for restriction enzymes. A 
problem can arise because passing a variable into a subroutine involves making a copy of the 
data into the subroutine's variables, and you've just used up a significant portion of your 
computer's memory. 


By passing a reference to the DNA sequence data, you avoid making a copy of the data, and 
your program will use less memory. It will also run much faster because copying large strings 
is a fairly time-consuming process for a program. 

Here's a simple example of how to pass a scalar reference to a subroutine: 


my Schroml = getchrom('l'); # assume we read in human chromosome 1 here 





my @enzyme sites = findrestrictionenzymes(\$chroml, 'HindIII'); 


sub findrestrictionenzymes { 
my(S$segref, Sre) = @ ; # Sseqref is a reference to a scalar string 
# Sre contains the name of a restriction enzyme 





. program logic follows, where $$seqref is the sequence data 


} 


Writing programs is a type of engineering, and engineering always seems to come back to the 
idea of tradeoffs. The downside of passing references to subroutines is that anything the 
subroutine does to the referenced data stays in effect after the subroutine has exited. This 
"action at a distance" needs to be treated with care, so as not to modify data unintentionally. 


2.2.5.3 Returning references from subroutines 


You'll see in Chapter 3 how the subroutine called new returns a reference to an anonymous 
data structure declared within the subroutine. Until then, I'll defer a detailed discussion of how 
this works; the bottom line is that a subroutine can return a reference because a reference is 
"really" just a scalar value. 


2.2.6 Symbolic Versus Hard References 


There are two kinds of references, hard and symbolic. Hard references actually point to 
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locations in computer memory. 


For example, a hard reference to a scalar: 
Sname = 'Joel'; 
is defined like so: 


Snameref = \Sname; 


and the values associated with the hard reference $Snameref are: 


print 'Snameref has the value ', Snameref, ' and points to the referent ', 
SSnameref, "\n"; 
This prints: 


Snameref has the value SCALAR(0x80fe4ac) and points to the referent Joel 





Symbolic references refer to a name, not an address. As a brief example, let's say we have 
four array variables @mark1, @mark2, @mark3, and @mark4. It is possible to have another 
variable that is set to one of these variable names; let's say the variable is called Sarrayname 
and it's set to the value mark3, and that is the array we want to access. 


You can place the $arrayname variable in a block. Because a block returns the value of its last 
expression, this block returns the string mark3. You can then place the special array symbol @ 
in front of the block, and Perl will recognize this as meaning the @mark3 array. Here is a 
demonstration of how this works: 


@markl = ( 'al', ‘a2', 'a3', 'a4' ); 
Cmarke Si MNol*,;. “p2"% "ba", "ba" je 
Cmark3 =) i "er. Ver. eat; Nea" je 
(marks = ( ‘Ol, "a2" "dB", "da" js 
Sarrayname = 'mark3'; 


print "@{Sarrayname}\n"; 


This program prints out the result: 

el e2 63 24 

Symbolic references are avoided by some programmers and used frequently by others; you 
may sometimes come across them, or even find yourself using them. They are used in the 
AUTOLOAD methods that install methods at runtime, which you'll learn about in the later 
chapters. 
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2.3 Matrices 


Perl matrices are built from simpler data structures using references. Recall that a matrix is a 
set of values that can be uniquely referenced by indexes. If only one index is required, the 
matrix is one-dimensional (this is exactly how an array works in Perl). If n indexes are 
required, the matrix is n-dimensional. 


2.3.1 Two-Dimensional Matrices 


A two-dimensional matrix is one of the simplest complex data structures. It can be 
conceptualized as a table of rows and columns, in which each element of the table is uniquely 
identified by its particular row and column. 


There are several ways to build matrices in Perl. We'll look at some of the most useful. 


Because there is no built-in matrix data structure, you have to build a matrix from other data 
structures. The most straightforward way to do this is with an array of arrays: 


@probes = ( 
[ile te. 2-2 lly 
Lay 20. By. Ls 
Sip By Gy. Ty 
[ty By. yy 2] 


)3 


print "The probe at row 1, column 2 has value ", Sprobes[1][2], "\n"; 


This prints out: 


The probe at row 1, column 2 has value 8 





Recall that in Perl the first element of an array is indexed 0; so row 1 in 

we this program is actually the second row, and column 2 is actually the third 
column. Sometimes you may want to refer to the Oth row as row 1; you 
have to adjust your code and your interactions with the user accordingly. 





This matrix is implemented as an array (in parentheses), each element of which is a reference 
to an anonymous array [in square brackets], which itself is a list of integers. 


Another good way to build an array is to declare a reference to an anonymous array. In the 
following example, I declare an empty anonymous array and then populate it as desired. This 
is, in effect, an anonymous array of anonymous arrays: 





# Declare reference to (empty) anonymous array 
Sarray = [ ]; 


# Initialize the array 
for (Si=09 Sa. < Ay eS a) { 
for($j=0; $3 < 4 ; ++$3) { 
Sarray->[Si] [$j] = $i * $j; 
} 
} 


# Reset one of the elements of the array 
Sarray->[3][2] = 99; 


# Print the array 
for(Si=0; Si. < 4 7 4451) { 
for ($j=0; $3 < 4 ; ++$4) { 
pEiNtE("S30 ", Sarray=>([$2L] [$5]); 
} 
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pint: ' \ial's 


Note the use of printf to format the output nicely. For a refresher on 
_ this Perl function, consult the Perl documentation, by typing: 


perldoc -f printf 

and 

perldoc -f sprintf 

at a shell prompt or check out http://www.perldoc.com. 





This program produces the following output: 


0 0 0 0 
0 i Z 3 
0 2 4 6 
0 3 OS o 


Alternatively, if the values are known, I can declare this as an anonymous array of 
anonymous arrays by saying: 


Sarray = [ 
[0, 0, O, QO], 
[0, l, 2, 3], 
[0, 2, 4, 6], 
[0, 3, 99, 9] 


l3 
I can also declare an array of anonymous arrays, by saying: 


@array = ( 
POs. Dig Dy 2] 4 
Oy. tia Ze Sy 
[Oy 2x Ae Oly 
3, 99) -9 


[0, 
)3 


Notice the slight syntactical difference between an array of anonymous arrays: 
Garray = ( [ Je Db Te wea DZ 

and an anonymous array of anonymous arrays: 

parray = [ol dy LL de «ss Te 

Note that Perl also allows you to say: 

$Sarray[$i] [$j] 
as a synonym for: 
Sarray->[$i] [$j] 
But beware confusing: 
Sarray->[$i] [$4] 
with: 
Sarray[$i] [$3] 
They are not the same thing and won't refer to the same array if you intermix them! 





Very often you read data in from a file that has the elements of a matrix displayed one row 
per line, and you have to store the data from that file in an array in your Perl program. Say 


you have the following data: 


0 0 0 0 
2 3 
4 6 

2) 


0 
0 
0 99 


GS: So 
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You can read the data into a Perl array with the following loop: 
while (<>) { 

@row = split; 

push(@array, [ @row ]); 


} 


This assumes that you've named the file on the command line as an argument to the 
program. Note that each incoming line is assigned to the special variable $_ on each iteration 
through the while loop. The split function uses this line stored in $_ by default. Each 
incoming line is split into an array of its whitespace-separated elements, and then an 
anonymous array [ @row ] containing those elements is pushed onto the @array array. 


For more details on arrays of arrays, see the perllol manpage; type perldoc perllol at 
your command prompt or visit the Perl documentation web site at http://www.perldoc.com. 





2.3.2 Higher-Dimensional Matrices 


To use a higher-dimensional matrix, simply add another dimension: 


# Populate a 3-dimensional array 
Sarray = [ ]; 


# Initialize the array 
for (Si=0F (Sa <4 2 Hoa). 4 
for($j=0; $3 < 4 ; ++$4) { 
for (Sk=0) Sk < @ } 48k) { 
Sarray->[Si] [$j] [$k] = $i * $j * k; 
} 


} 


The sharp-witted reader may have noticed that we seem to be omitting arrow operators 
between array subscripts. (After all, these are anonymous arrays of anonymous arrays of 
anonymous arrays, etc., so shouldn't they be written [Sarray->[$i]->[$j]->[$k]?) Perl 


allows this; only the arrow operator between the variable name and the first array subscript is 


required. It make things easier on the eyes and helps avoid carpal tunnel syndrome. On the 
other hand, you may prefer to keep the dereferencing arrows in place, to make it clear you 
are dealing with references. Your choice. 


There's no need to stop at three-dimensional arrays. If higher-dimensional arrays are hard to 
imagine, just don't think of "dimension" as tied to space. For instance, four- dimensional 
arrays have points that are uniquely identified by four indices; five- dimensional arrays have 
points that are uniquely identified by five indices, etc. In fact, subatomic space is thought to 
contain eleven dimensions. 


2.3.3 Sparse Arrays 


Some programs need arrays, but only a small number of the array elements are ever used. 
Such arrays are called sparse arrays. 


It would be inefficient to declare, for instance, a 1,000-by-1,000 element array, 1 million 
elements in all, if only 100 elements are ever actually used. For such sparse two-dimensional 
arrays, it's best to implement the array as a hash of hashes: 


Sarray = { }; 


Sarray->{4}{83} = 'set'; 
Sarray->{34}{9} = 'set'; 
print Sarray->{4}{83}, "\n"; 





print Sarray->{34}{9}, "\n"; 


This prints out: 
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set 
set 


Perl creates only the table elements referenced, which makes an efficient implementation for 
a sparse matrix. However, because merely looking at a location (to see if there's anything 
there) creates an entry in the hash, you have to use the Perl exists function to keep your 
hashes sparse when looking at them. exists reports on whether a particular key (or array 
element) has been created, without actually creating it.’ So to explore the sparse matrix just 
shown, you can Say: 

'4] The function defined is related but different; when used on a hash element, it checks if the value is 
undef, not whether the value exists. 


Sarray = { }; 


"set'; 
"set'; 


Sarray->{4}{83} 
Sarray->{34}{9} 


for (my S$i2=0 7 $3 < 100 - #492) { 
for(my $7=0 7 $7 < 100 7 +499) 4 
if( exists (Sarray->{S$i}) and exists (Sarray->{Si}{$j}) ) { 
print "Array element row $i column $j is Sarray->{Si}{S$j}\n"; 


} 
} 


This reports, without increasing the size of the array, that: 


row 4 column 83 is set 
row 34 column 9 is set 


Array element 
Array element 





Question: why did you need two exists tests? (Hint: it's a two-dimensional array.) Another 
question: is Sarray a hash or a reference to an anonymous hash? Can you implement it the 
other way? See the exercises for this chapter. 
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2.4 Complex Data Structures 


Different algorithms require different data structures. Using references in Perl, it is possible to 
build very complex data structures. 


This section gives a short introduction to some of the possibilities, such as a hash with array 
values and a two-dimensional array of hashes. See the recommended reading in Section 2.9 
of this chapter for books and sections of the Perl manual that are very helpful. 


Perl uses the basic data types of scalar, array, and hash, plus the ability to declare scalar 
references to those basic data types, to build more complex structures. For instance, an 
array must have scalar elements, but those scalar elements can be references to hashes, in 
which case you have effectively created an array of hashes. 


2.4.1 Hash with Array Values 


A common example of a complex data structure is a hash with array values. Using such a 
data structure, you can associate a list of items with each keyword. The following code 

shows an example of how to build and manage such a data structure. Assume you have a set 
of human genes, and for each human gene, you want to manage an array of organisms that 
are known to have closely related genes. Of course, each such array of related organisms can 
be a different length: 


use Data: :Dumper; 





Ssrelatedgenes = ( ); 


Srelatedgenes{'stromelysin'} = [ 
"C.elegans', 
"Arabidopsis thaliana' 

l; 

Srelatedgenes{'obesity'} = [ 
"Drosophila', 
"Mus musculus' 





]; 
# Now add a new related organism to the entry for 'stromelysin' 


push( @{Srelatedgenes{'stromelysin'}}, 'Canis' ); 





print Dumper (\%relatedgenes) ; 


This program prints out the following (the very useful Data: : Dumper module is described in 
more detail later; try typing perldoc Data: : Dumper for the details of this useful way to print 
out complex data structures): 
SVAR1 = { 
"stromelysin' => [ 
"C.elegans', 
"Arabidopsis thaliana', 
"Canis' 
] , 
‘obesity' => [ 
"Drosophila', 
"Mus musculus' 
] 
}; 


The tricky part of this short program is the push. The first argument to push must be an 
array. In the program, this array is @{Srelatedgenes{'stromelysin'}}. Examining this 
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array from the inside out, you can see that it refers to the value of the hash with key 
stromelysin: $relatedgenes{'stromelysin'}. You know that the values of this 
srelatedgenes hash are references to anonymous arrays. This hash value is contained within 
a block of curly braces, which returns the reference to the anonymous array: 
{Srelatedgenes{'stromelysin'}}, and the block is preceded by an @ sign that dereferences 
the anonymous array: @{Srelatedgenes{'stromelysin'}}. 


2.4.2 Two-Dimensional Array of Hashes 


As another example, say you have data from a microarray experiment in which each location 
on a plate can be identified by an x and y location; each location is also associated with a 
particular gene and has a set of reported measurements. You can implement this particular 
data as a two-dimensional array, each entry of which is a (reference to a) hash whose keys 
are gene names and whose values are (references to) arrays of the measurements. Here's 
how you can initialize one of the entries of that two-dimensional array: 


Sarray[3][4]{'stromelysin'} = [3, 4, 5]; 


The position on the plate is represented by an entry in the two-dimensional array such as 
Sarray[3] [4]. The fact that the entry is a hash is shown by the reference to a particular key 
with {'stromelysin'}. That the value for that key is an array is shown by the assignment to 
that key Sarray[3][4]{'stromelysin'} of the anonymous array [3, 4, 5]. To print out 
the array associated with the key stromelysin, you have to remember to tell Perl that the 
value for that key is an array by surrounding the expression with curly braces preceded by an 
@ sign @{Sarray[3] [4]{'stromelysin'}}: 


Sarray[3][4]{'stromelysin'} = [3, 4, 5]; 

print "The scores for plate position 3, 4 were @{Sarray[3] [4]{'stromelysin' }} 
\n"} 

This prints: 


The scores for plate position 3, 4 were 3 4 5 


A common Perl trick is to dereference a complex data structure by enclosing the whole thing 
in curly braces and preceding it with the correct symbol: $, @, or %. So, take a moment and 
reread the last example. Do you see how the following: 


Sarray[3][4]{'stromelysin'} 
is the key for a hash? Do you see how the phrase: 
@{Sarray[3] [4] {'stromelysin'}} 
makes it clear that the value for that hash key is an array? Similarly, if the value for that hash 
key was a Scalar, you could say: 
S{Sarray[3][4]{'stromelysin'}} 
and if the value for that hash key was a hash, you could say: 
${Sarray[3][4]{'stromelysin'}} 


2.4.3 Complex Data Structures 


References give you a fair amount of flexibility. For example, your data structures can 
combine references to different types of data. You can have an anonymous array such as in 
the following short program: 
Sgene = [ 

# hash of basic information about the gene name, discoverer, 

# discovery date and laboratory. 


{ 


name => 'antiaging', 
reference => [ 'G. Mendel', '1865'], 
laboratory => [ 'Dept. of Genetics', 'Cornell University', 'USA'] 


}, 
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# scalar giving priority 
"high", 


# array of local work history 
['dJim', 'Rose', 'Eamon', 'Joe'] 





]; 
print "Name is ", ${$gene->[0]}{'name'}, "\n"; 
print "Priority is ", $gene->[1], "\n"; 


print "Research center is ", ${${$gene->[0]}{'laboratory'}}[1], "\n"; 





print "These individuals worked on the gene: ", "@{Sgene->[2]}", "\n"; 


This program produces the output: 

Name is antiaging 

Priority is high 

Research center is Cornell University 

These individuals worked on the gene: Jim Rose Eamon Joe 





Let's examine this code to understand how it works; it contains most of the points made in 
this chapter. 


Sgene is a pointer to an anonymous array of three elements. Therefore each element of 
Sgene is referred to by either: 


SSgene[0] 
SSgene[1] 
SSgene[2] 
or equivalently (and our choice in this code) by: 


Sgene->[0] 
Sgene->[1] 
Sgene->[2] 


To be specific, the first element is a reference to an anonymous hash, the second element is 
a scalar string high, and the third element is a reference to an anonymous workgroup array. 


The plot thickens when you examine the anonymous hash that is referenced by the first array 
element. It has three keys, one of which, name, has a simple scalar value. The other two keys 
have values that are references to anonymous arrays of scalar strings. 


So, this certainly qualifies as a complex data structure! 


When you place any of the elements of the $gene anonymous array within a block of curly 
braces, you have a reference that must be dereferenced appropriately. To refer to the entire 
hash at the beginning of the array, say: 


S{$gene->[0]} 

As done with the program code, the scalar value that is the second element of the array is 
accessed simply as: 

Sgene->[1] 

The third part of this data structure is an anonymous array, which we can refer to in total as: 
@{$gene->[2] } 

This is also done in the program code. 


Now, let's finish by looking into the first element of the $gene anonymous array. This is a 
reference to an anonymous hash. One of the keys of that hash has a simple scalar string 
value, which is referenced with: 


${Sgene->[0]} {name} 
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as was done in the program code. To make sure we understand this, let's write it out: 
${Sgene->[0]} {name} 

is 
S hashref {name } 

is 
‘antiaging' 
{Sgene->[0]} is a block containing a reference to an anonymous hash. It is then used as is 
typical for a hash reference: it's preceded by a $ and followed by the key name in curly braces 
and so resolves to a lookup of the key name in the anonymous hash. 


The most intricate dereference in this program is that which digs out the name of the 
research center: 


${${Sgene->[0]} {laboratory} }[1] 


is 

S{S hashref {laboratory} }[1] 
is 

§ arrayref [1] 
is 


"Cornell University' 


Here, the {Sgene->[0]} is a reference to an anonymous hash. The value for the key 
laboratory is retrieved from that anonymous hash; the value is an anonymous array. 
Finally, that anonymous array ${$gene->[0]} {laboratory} is enclosed in a block of curly 
braces, preceded by a $, and followed by an array index 1 in square brackets, which 
dereferences the anonymous array and returns the second element Cornell University. 


Note that the last expression can also be written as: 
Sgene->[0]->{laboratory}->[1] 
You see how the use of references within blocks enables you to dereference some rather 


deep-nested data structures. I urge you to take the time to understand this example and to 
use the resources listed in Section 2.9. 
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2.5 Printing Complex Data Structures 


Sometimes you need to look inside your complex data structures to see what the settings 
are. One of the most useful ways to examine a data structure is by means of the 
Data: :Dumper module. This module comes standard with all recent versions of Perl. 


Here is the summary and part of the synopsis and description as output from the perldoc 
Data: :Dumper command: 


NAME 





Data::Dumper - stringified perl data structures, suitable 
for both printing and "eval" 


SYNOPSIS 
use Data::Dumper; 


# simple procedural interface 
print Dumper(Sfoo, Sbar); 


DESCRIPTION 
Given a list of scalars or reference variables, writes out 
their contents in perl syntax. The references can also be 
objects. The contents of each variable is output ina 





Single Perl statement. Handles self-referential strucTures correctly. 





The return value can be "eval"ed to get back an identical 
copy of the original reference structure. 


(pared 
This output of a two-dimensional array illustrates its use: 


use Data: :Dumper; 
Sarray = [ ]; 


# Initialize the array 
for ($i=0; Si < 4 3 +462) { 
for ($j=0; $3 < 4 ; ++83) { 
Sarray->[S$i] [$j] = $i * $j; 
} 
} 


# Print the array "by hand" 
for($i=0; Si < 4 ; ++Si) { 
for(e7—G7 27 = 4 7 rea) 4 
printf£("$3d ", Sarray->[$i][$j]); 
} 
print “\n"'? 


} 


# Print the array using Data::Dumper 
print Dumper (Sarray) ; 


This produces the output: 


0 0 0 0 

0 1 2 3 

0 2 4 6 

0 3 6 g 
SVARI1 = [ 
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You can make a nicer display by knowing exactly what the data is and in what form to write it 
out. Data: :Dumper can also display the data in a fairly readable format (and there are several 
options as to how the data is displayed). In addition, Data: :Dumper allows you to dump a 
data structure out to a file and then read it in to another program. See the perldoc 

Data: :Dumper manpage for more details. 


You can also print out an array of arrays @array by printing each row one at a time. 
Remember that each row is an anonymous array, so each entry of the @array array is a 
reference to an anonymous array: 


@array = ( 
[Oy Os. By “Oly 
[OG Ly Ze oily 
LO). 2y Ay Ole 
LOG. Se- 95-- 2] 


Me 


for Sanon (@array) { 
print "@Sanon\n"; 
} 
See the Perl perllol reference page for more information on initializing and printing arrays of 
arrays. 
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2.6 Data Structures in Action 


The previous sections introduced a fair amount of new Perl syntax and capabilities. Now, let's 
see some of these new capabilities in action. 


2.6.1 The Problem of String Matching 


It is frequently important in biology to find the best possible match for a short sequence in a 
longer sequence; for example, between an oligonucleotide and the sequence of DNA that has 
been cloned in a YAC or BAC. This match need not always be perfect; frequently, what is 
important is to find the closest match available. This problem is known in computer science as 
approximate string matching, and dynamic programming is a popular technique used to 
compute the solution. 


The problem of string matching is to find a pattern, such as a nucleotide or peptide fragment, 
in a longer text such as a chromosome or protein. 


The problem of approximate string matching is to find a pattern in a text in which the match 
might not be perfect. Perhaps a few of the characters are different or missing; the problem is 
to find the best match possible. 


2.6.2 Genetic Variability and String Matching 


Biologically, approximate matches are of commanding importance. Evolutionary changes 
between species can make genes with essentially the same function collect a fair number of 
individual base changes; they may even have acquired differences in exon structure. Even 
within a species, individual base changes among groups in the population (single nucleotide 
polymorphisms) are important causes of disease and important clues in the discovery of 
disease-causing genes. 


Mutations tend to accumulate over time in noncoding regions of DNA; mutations in coding 
regions tend to avoid altering critical regions essential for the functioning of the gene (where 
mutations may be fatal to the organism). Even a noncoding region may be critical to the 
regulation of a gene and thus tend to resist mutations. As a result, studying where mutations 
are not accumulating is often an important clue to discerning the function and control of a 
gene and its associated protein. 


Due to the redundancy of the genetic code, mutations in DNA may not affect the coding of 
the associated protein. Other mutations may make a change in a coded amino acid, but it 
may be an amino acid with similar properties to the original amino acid; for instance, both 
amino acids may be hydrophilic. Tracing these kinds of mutations is another source of 
important information about the process of mutation. It can give vital clues to the 
conservation of critical coding regions and to the folding of proteins. 


Biologists will have no difficulty expanding upon the preceding brief motivation for studying 
approximate string matching and other techniques that find similarities between biological 
molecules. Many standard laboratory techniques rely on the annealing of a molecule to 
another molecule with similar, but not necessarily exact, structure. 


In the discussion that follows, you'll use your new knowledge of complex data structures in 
Perl to implement an algorithm that finds approximate matches of patterns in text, such as 
DNA fragments in chromosomes or peptide fragments in proteins. 


The algorithm I present is an example of dynamic programming; it relies on computing the 
entries to a matrix. 
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2./ Dynamic Programming 


Dynamic programming computes the values for small subproblems and stores those values in 
a matrix. The stored values are then used to solve larger subproblems (without incurring the 
cost of recomputing the smaller subproblems) and so on until the solution to the overall 
problem is found. The term "dynamic programming" is a bit of a misnomer since it doesn't 
involve changing values over time as the word "dynamic" suggests. 


This approach relies on having a data structure available to store the intermediate values as 
the algorithm proceeds. The data structure may require a fair amount of computer memory, 
but the overall speed of the algorithm often makes the memory cost worthwhile. In this 
section, we'll use a Perl multidimensional array, namely a simple two-dimensional matrix, to 
solve an approximate string matching problem. 


Our algorithm will find a (shorter) pattern in a (longer) text. We'll start with a two-dimensional 
array, or matrix. The columns of the matrix will be associated with the (shorter) pattern, and 
the rows of the matrix will be associated with the (longer) text. The zeroth row and the 
zeroth column will be initialized to the appropriate starting values. We'll then calculate each 
value in the matrix by examining adjacent, already calculated values in conjunction with the 
characters of the pattern and the text. After the entire matrix has been filled in, we'll have the 
answer to our problem. That is, we'll find the position(s) in the text that most closely match 
the pattern, and we'll do so by simply examining the values in the last row of the matrix. 
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2.8 Approximate String Matching 


You've most likely learned how to use regular expressions to find any of a set of possible 
patterns in a string. Approximate string matching is similar: an approximate string matching 
algorithm finds any of a set of possible patterns in a string. However, the two approaches are 
quite different in their capabilities and their ease of use. Simply stated, approximate string 
matching can find many close matches to a pattern that would be very tedious to specify 
using regular expressions. 


2.8.1 Edit Distance 


There are several ways to measure the distance between two strings, and our algorithm will 
use one such measure. Some variants of this measure are considered in the exercises at the 
end of the chapter. 


Our algorithm uses the idea of edit distance to measure the similarity between two strings. 
The idea is quite simple. Assume that there are three things you can do to alter a string: 
Substitution 

Change any character to a different character 
Deletion 

Delete any character 
Insertion 

Insert a new character at any position 
Now, let's say that every time you make any of these three edits, you incur an edit cost of 1. 


Now, call the edit distance between two strings as the minimum edit cost needed to change 
one string into the other. 


For instance, let's say there are two strings portend and profound. You can apply the 
following edits to portend: 


portend 

(delete o) 
prtend 

(insert o) 
protend 

(change t to f) 
profend 

(change e to 0) 
profond 

(insert wu) 
profound 


You can see that five edits were applied. Assuming you can't find a quicker way to change one 
string into the other, the edit distance between the two strings is 5. 


Clearly, you can also start from the other string and apply the same edits in reverse (just 
interchanging the deletions and insertions, and reversing the substitutions). So, starting from 
profound, delete u, change o to e, change f to t, delete o, and finally insert 0, to arrive at 
portend. 


The relevance of this concept of edit distance to sequence alignment is simply: the smaller the 
edit distance, the better the alignment. 


We'll ignore for the moment whether these are the most biologically relevant types of edits. 
However, you may want to start thinking of "edits" as "mutations," and consider whether the 
problem is perhaps formulated too simplistically to model the actual process of mutation. 
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What about the differences between DNA and proteins in this regard? Should we permit 
switching the order of two adjacent characters as an edit? How about reversing a substring of 
the string, as in the known biological process of inversion? Does our assignment of an equal 
cost to each "edit" make biological sense? Keep these questions in mind; we may eventually 
want to improve the algorithm by incorporating some of the modifications suggested by 
these ideas. 


In approximate string matching, the problem is to find a (short) pattern in a (usually much 
longer) text. Of the many possible variants of this problem, let's find the location(s) in the 
text with the smallest possible edit distance from the pattern. We'll do so in the next section. 


2.8.1.1 A string matching program 


Now that you've got an idea of the nature of the problem, its importance to biology, and the 
basic definitions that we'll use in approaching the problem, let's take a look at a Perl program 
that solves our problem using dynamic programming. This program uses the references and 
multidimensional arrays introduced earlier in this chapter. 


As usual, my approach is to proceed as quickly as possible to practical matters, not dwelling 
on the theoretical aspects of the computer science involved. I urge the interested reader to 
consult the references at the end of the chapter for more satisfying details on the techniques 
of dynamic programming and on the many different approaches that have been developed to 
solve the approximate string matching problem. 


The following is a Perl program for approximate string matching, followed by a somewhat 
detailed discussion of how it works. The program proceeds as follows: first, the pattern and 
text are defined (as peptides in this case). Then the distance matrix, a two-dimensional array, 
is declared. After initializing the first row and the first column to the appropriate values (to be 
discussed later), the algorithm proceeds to fill in the matrix. Each matrix location is calculated 
by examining three adjacent locations, plus the amino acids in the pattern and the text at that 
spot; the resulting matrix is printed out. Finally, the best match is found by looking for the 
smallest edit distance entered in the last row of the matrix. The final result is printed out. 


#!/usr/bin/perl 


Approximate string matching using dynamic programming 


Find the closest match to the pattern in the text 


SHH Heo HEHE 


se strict; 
se warnings; 


(Sen 





my Spattern = 'EIQADEVRL'; 
print "PATTERN: \nSpattern\n"; 




















my Stext = 'SVLODRSMPHQEILAADEVLOQESEMRQQDMI SHDE'; 
print "TEXT:\nStext\n"; 




















m 
m 











TLEN length $text; 
PLEN = length Spattern; 





y $ 
y $ 








D is the Distance matrix, which shows the "edit distance" between 
substrings of the pattern and the text. 

It is implemented as a reference to an anonymous array. 

my SD=[ ]; 





The rows correspond to the text 
Initialize row 0 of D. 

for (my $t=0; St <= STLEN ; ++St) { 
SD=> Lee) LO = Oy 
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} 


# The columns correspond to the pattern 

# Initialize column 0 of D. 

for (my S$p=0; Sp <= SPLEN ; ++Sp) { 
$D->[0] [Sp] = Sp; 





} 


# Compute the edit distances. 
for (my $t=1; St <= STLEN ; ++St) { 
for (my $p=1; Sp <= SPLEN ; ++$p) { 








$D->[St][$p] = 


# Choose whichever of the three alternatives has the least cost 
min3 ( 

# First alternative 

# The text and pattern may or may not match at this character 

substr(Stext, $t-1, 1) eq substr(S$pattern, S$p-1, 1) 

? S$D->[S$t-1][Sp-l] # If they match, no increase in edit 

distance! 
eD=>[St-1] [Sp=1] 1, 


# Second alternative 
# If the text is missing a character 
$D->[$t-1] [$p] + 1, 


# Third alternative 
# If the pattern is missing a character 
$D->[$t][$p-1] + 1 


} 


# Print D, the resulting edit distance array 
for (my S$p=0; Sp <= SPLEN ; ++Sp) { 
for (my S$t=0; St <= STLEN ; ++S$t) { 
print $D->[$t][$p], ""; 











} 
prant Am"? 


} 


my @matches = ( ); 
my Sbestscore = 10000000; 


# Find the best match(es). 
# The edit distances appear in the the last row. 
for (my $t=1 ; St <= STLEN ; ++St) { 
if( $D->[$t][$PLEN] < Sbestscore) { 
Sbestscore = $D->[St] [SPLEN]; 
@matches = (St); 
}elsif( SD->[$t] [SPLEN] = = Sbestscore) { 
push(@matches, St) 




















} 
} 


# Report the best match(es). 

print "\nThe best match for the pattern Spattern\n"; 
print "has an edit distance of Sbestscore\n"; 

print "and appears in the text ending at location"; 
print "s" if ( @matches > 1); 

print " @matches\n"; 
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sub mins 4 
my($i, $3, $k) = @; 
my (Stmp) ; 


Stmp: = (Sa < Si) FSi 2 Saye 
Stmp < Sk ? Stmp : $k; 
} 


Here is the output: 









































PATTERN: 
EIQADEVRL 

TEXT: 

SVLQODRSMPHQEI LAADEVLQESEMRQQDMISHDE 

OOO 00 0. 0 0 0-00 0 0 0.0 0 0 O00 000000 0.0 0 00 0 0 0 0 0 0 
LA PALEPLAREDPLLOP TLE ero PLLA Liha 2a 
SSSSRPPSPSeSSesVGOLeeesitisesiiptrisee227iesei 
SSR SF SS Sse artetiLstes2 tee eee ee eeaeseFe e332 
44.4.4 3344 4 4:4 3:3 221223323 33 3:5 3 223. 3°3 4 3 3 2 2.3 
S555 434555544332 223444 4444 44 A ZA AA ABA 
6666544566654 44333234545455554405 5 5 4 3 
YY 6 76 Sb S S67 76 SS Sb 44482 Seo SS S66 6 S'S S 6 6 54 
oo 7 FY 7 6 S-6 6 7 8 7 6 6-6 55 5 4 3°34. 5 6 6 6 5.6 7 G6 6-6 7°65 
Oe Oe OF BO Ge 16. OT OP BBO OT GS 6. GB 6 Se AS AS GT PG GE OT 














The best match for the pattern EIQADEVRL 
has an edit distance of 3 
and appears in the text ending at location 20 


2.8.1.2 Analysis 


Let's examine this program in detail to see how the Perl references and matrices are used and 
to learn how dynamic programming can solve the approximate string matching problem. 


You can see that I specified a short pattern and a fairly short text. The shortness of the text is 
purely for didactic purposes: I want to show the entire matrix that is computed. To see how it 
performs, you should try running the program by reading in, say, human chromosome 1, and 
searching for a 20 basepair oligonucleotide. (You'll probably want to omit the printing of the 
matrix in this case!) 


Because our program needs to refer to the lengths of the pattern and text at several points, 
we precompute them to save time and to make the code easier to read. ($PLEN is easier to 
read in those tightly packed loops than length Spattern.) 





Next, the array D is declared: 
my SD=— [ IF 


$D refers to the anonymous array denoted by [ ]. It is populated as the algorithm 
progresses, as a two-dimensional array of integers. 


The code for the calculation of the edit distance array $D is very short. The Oth row and the 
Oth column are initialized to start off the computation. The Oth row is initialized to all Os, and 
the Oth column is initialized to 0, 1, 2, ... up to PLEN, 





Why? Let's talk about how this algorithm is going to work. 








Each entry of the TLEN+1 x PLEN+1 matrix contains the edit distance between a prefix of the 
pattern and a substring of the text. To be more precise: consider the entry of $D at row $t 
and column $p (we'll call it in actual Perl syntax $D->[S$t] [Sp]). The value of $D at this 
position represents the edit distance between the prefix of the pattern of length $p, anda 
substring of the text ending at text position $t. 


Let's look at an actual example. Figure 2-1 shows the output of the edit distance matrix 
again, with the input strings lined up with the rows and columns. 
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Figure 2-1. Edit distance matrix 


SVLODRSMPHOETLAADEVLOESEMROODMISHDE 
CNODDDDDD OOD DOO ODO OND GOO0NDD00ND0000000 
E111111111111011111011101011111111110 
1222222222222101222112211112222212221 
0333323333332211233222222222223322332 
A444433444443322123333333333333433343 
D555543455554433222344444444443444434 
£666654456665444333234545455554455543 
V776765556776555444323455556665556654 
R887776566787666555433456665676666765 
L998787667788776666543456776677777776 


The entry at row 4 and column 15, that is, $D->[4] [15], highlighted in the figure, has the 
value 1. This means that the first four characters of the pattern, EIQA, are one edit away 
from a substring that ends at position 15 in the text. Let's check this. The prefix of length 4 of 
the pattern is ETQA. A substring that ends at position 15 in the text is EILA. (Note that the 
beginning position of this substring isn't clear; more on that later and in the exercises.) You 
see that one substitution, of L in the text to Q in the pattern, transforms the one string to the 
other. By this definition, the edit distance between these two strings is 1. That's what the 
entry in the matrix says, so that entry is verified. 











Take a minute to verify another entry or two. 


Because you're looking for a match for the entire pattern, it follows that you merely need to 
look at the last row of the edit distance matrix $D. Each position in that row shows the edit 
distance between the entire pattern and a substring of the text that ends at that position. 


Take a look at two or three positions in the last row, and satisfy yourself that the value at 
each of those positions corresponds to the edit distance between the pattern and a substring 
of the text ending at that position. 


Finally, because you're looking for the best match for the pattern in the text, check the 
position or positions in the last row of the $D matrix that have the minimum value. If there is 
an exact match for the pattern in the text, there will be a 0 in some position in the last row of 
the $D matrix. 


Make sure you also understand the size of the $D matrix and how the positions correspond to 
positions in the pattern and text strings. 


The border conditions of the computation are given by the Oth row and Oth column. (Border 
conditions are just the starting values of a computation that are initialized before the 
computation begins.) If you're unfamiliar with border conditions, the initialization may seem a 
little strange, but it is absolutely typical of many algorithms that they have seemingly trivial 
border conditions. 





Each position in any row, from 1 to TLEN, corresponds to a character of the text, starting 
from the first character to the last. The Oth row is initialized to all Os because there is an edit 
distance of 0 from the prefix of the pattern of length O to any position in the text. In other 
words, the empty string (a pattern of length 0) can be matched at any position of the text, 
so the edit distance is zero all along the Oth row. 





Similarly, the Oth column is initialized to 0, 1, ..., up to PLEN in the last position of the column. 
These values indicate that at each position in the Oth column, say, at position Sp, 
corresponding to the prefix of the pattern of length $p, it takes $p edits (in particular, $p 
insertions) to match the substring of the text ending at position O (that is, a substring of 
length 0). 


The matrix is now set up, and the Oth column and Oth row are initialized; let's see how the 
algorithm proceeds. 


Figure 2-2 shows the key idea. The value of a position in the matrix, say: 
$D->[$t] [$p] 
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can be computed by checking the values of the three adjacent positions: 


$D->[$t-1] [Sp] 
$D->[$t] [$p-1] 
SD=> [90-1] [Sp=1] 


and considering whether or not the character at position t of the text is the same as the 
character at position $p of the pattern. 


Figure 2-2. Computing value for row p and column t 





Let's see how this works. The desired, uncomputed matrix entry at $D->[$t] [$p], is 
represented as ???. The previously computed values at the three adjacent matrix positions 
are represented by the letters A, B, and C. 


The distance of the match of the pattern prefix of length $p-1 ends at the text at position 
$t-1, represented in the figure as the integer A. If the next character in the pattern, at 
position Sp, is the same as the next character in the text, at position $t, the edit distance of 
this longer pattern, which ends at this next position in the text, is still A (because no additional 
substitution, deletion, or insertion was required). 


On the other hand, if the characters are different, you'd get a new value of A+1 for this 
match of a longer pattern with a later position in the text. 


In the code, this first value, which may be one of two different alternatives depending on the 
characters in the text and the pattern, is programmed using the Perl conditional operator 
TEST ? Expressionl : Expression2, This operator returns the value of the first expression if 
the test is true or the value of the second expression if the test is false. Here, the TEST is: 


substr ($text, S$t-1, 1) eq substr(Spattern, S$p-1, 1) 




















the first expression is $D->[$t-1] [S$p-1], and the second expression is $D->[$t-1] [$p-1] + 
1. 


# First alternative 

# The text and pattern may or may not match at this character ... 

substr ($text, S$t-1, 1) eq substr(Spattern, $p-1, 1) 

2? SD->[St-1][Sp-1] # If they match, no increase in edit distance! 
$D->[$t-1] [$p-1] + 1, 


Now, consider the case of the match at the position $D->[$t] [$p-1] with value B. This 
represents the match of the pattern prefix of length $p-1 ending at the text at position $t. If 
you insert another character of the pattern to match at the same position in the text, that 
incurs an additional edit distance for the insertion, so the new value is B+1. The code for this 
value is: 





# Second alternative 
# If the text is missing a character 
$D->[St-1] [$p] + 1, 


Similarly, consider the case of the match at the position $D->[$t-1] [$p] with value c. This 
represents the match of the pattern prefix of length $p ending at the text at position $t-1. If 
you extend the match to another character of the text to match at the same position in the 
pattern, that incurs an additional edit distance for the insertion, so the new value is C+1. The 
following is the code for this value. 

# Third alternative 

# If the pattern is missing a character 

PD=>(St) [Spa L), ae 


These three alternatives are the three arguments to the min3 subroutine, which returns the 
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minimum from among three values. 


When considering edit distances, you can start from either string. Inserting a character in one 
string is the same as deleting a character from the other string, depending on which string 
you start with. 


I assert that this procedure covers all possible ways of extending a match. This assertion can 
be proved, but in the interests of practicality, we'll skip that essential step of demonstrating 
program correctness. The interested reader may want to think about how to prove such an 
assertion or may simply refer to the literature in Section 2.9. 


So, what value do you put in place of the ??? placeholder? Since we're looking for the 
smallest edit distance, let's choose the minimum of the values that have been examined. If 
character $t and character Sp are the same, you'd choose the minimum of the values A, 
B+1, and C+1. On the other hand, if character $t and character $p are different, you'd 
choose the minimum of the values A+1, B+1, and C+1. 


That's our algorithm. Take a moment to examine the Perl code, reproduced here, that 
calculates the matrix, and see how the code reflects our discussion of the algorithm: 
# Compute the edit distances. 
for (my $t=1; St <= STLEN ; ++St) { 
for (my S$p=1; Sp <= SPLEN ; ++$p) { 








$D->[St][$p] = 


# Choose whichever of the three alternatives has the least cost 
min3 ( 
# First alternative 
# The text and pattern may or may not match at this character 
substr(Stext, $t-1, 1) eq substr(S$pattern, $p-1, 1) 
? S$D->[S$t-1][Sp-l] # If they match, no increase in edit 


distance! 
SD=>[St=1] [Sp=1] + A, 
# Second alternative 
# If the text is missing a character 
$D->[$t-1] [$p] + 1, 
# Third alternative 
# If the pattern is missing a character 
SD-> [St] [Sp=2]. + 1 
) 
} 
} 
[ Team LiB ] 
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2.9 Resources 


I recommend the following O'Reilly sources for more details on data structures in Perl: 


Here's 


Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant. This is the bible 
of Perl programming. Everything about data structures is explained in detail. 


Advanced Perl Programming by Sriram Srinivasan. An excellent book that covers 
references and data structures. 


Mastering Algorithms with Perl by Jon Orwant, Jarkko Hietaniemi, and John 
Macdonald. A marvelous book, especially if you like this chapter. Many interesting data 
structures and algorithms are explained and implemented in Perl. 


The Perl Cookbook by Tom Christiansen and Nathan Torkington. As the title implies, 
this book is composed of fairly short recipes that accomplish particular tasks, grouped 
according to application area. 


where to go for Perl documentation: 


The perlreftut tutorial page from the Perl documentation gives a short introduction 
to Perl references (type perldoc perlreftut at your command line if Perl is installed, 
or visit the web page http://www.perldoc.com). 





The perlref tutorial page from the Perl documentation discusses Perl references in 
detail. 


The perldata tutorial page from the Perl documentation gives an introduction to Perl 
data structures. 


The perldsc tutorial page from the Perl documentation presents a "cookbook" 
overview of Perl data structures. 


The perllol tutorial page from the Perl documentation gives an introduction to arrays 
of arrays. 


The literature on algorithms is vast, including many textbooks as well as advanced 
monographs and peer reviewed journals. I recommend the following sources as a few entry 
points for more details on dynamic programming and algorithms: 


Computer Algorithms by Sara Baase and Allen VanGelder (Addison Wesley). This book 
is clearly written for the undergraduate level, and includes a very nice explanation of 
the algorithm presented in this chapter. 


Introduction to Algorithms by Thomas Cormen, Charles Leiserson, and Ron Rivest 
(MIT Press and McGraw-Hill). An excellent and standard college textbook, well written; 
can serve as a reference book for the serious programmer. 


Algorithms on Strings, Trees, and Sequences: Computer Science and Computational 
Biology by Dan Gusfield (Cambridge University Press). This book specializes in 
algorithms for strings, including such topics as sequence alignment. It's very detailed, 
but even so, not completethis is a big field! The best single source on string algorithms, 
with lots of information about biological sequence similarity. 


Data Structures and Algorithms by Alfred Aho, John E. Hopcroft, and Jeffrey Ullman 
(Addison Wesley). This is the classic book on the science of algorithms. 


Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes by 
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Frank Thomson Leighton (Morgan Kaufmann). A comprehensive, rigorous text and 
reference book. 


e Randomized Algorithms by Rajeev Motwani and Prabhakar Raghavan (Cambridge 
University Press). A clear, rigorous course. 
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2.10 Exercises 


Exercise 2.1 


Suggest a programming situation in which it would make sense to have several scalar 
references to one scalar variable that contains a peptide fragment. 


Exercise 2.2 

When might you want to use a reference to an (anonymous) scalar constant? 
Exercise 2.3 

Is $Sarr[0] the same as Sarr->[0]? Why or why not? 
Exercise 2.4 


Write a subroutine that returns a reference to a hash. Declare a reference to this 
subroutine and call it using the reference, then print out the hash whose reference is 
returned from the subroutine. 


Exercise 2.5 


Write a subroutine that returns a new anonymous subroutine based on its arguments, 
which are passed to it as references. Call the subroutine and then call the new 
subroutine that is returned. 


Exercise 2.6 
Write a subroutine to multiply two matrices. 
Exercise 2.7 


Develop a data structure that is a hash at the top level and can be used to record the 
data from microarray runs. 


Exercise 2.8 


Write a min subroutine that returns the minimum of two integers. Rewrite min3 using 
it. 


Exercise 2.9 


Make a subroutine that prints the distance matrix. Make it handle the display of longer 
numbers appropriately. 


Exercise 2.10 


Make the subroutine from Exercise 2.9 also display the pattern and string aligned with 
the output of the edit distance matrix. 


Exercise 2.11 


Make a subroutine that returns the edit distance array, the best score, and the 
locations of the best matches. 


Exercise 2.12 


Any difference in Exercise 2.11 if you calculate row by row instead of column by 
column? 


Exercise 2.13 

In Exercise 2.11, save space by keeping only two rows in memory. 
Exercise 2.14 

In Exercise 2.11, report on types of edits of matches. 
Exercise 2.15 


In Exercise 2.14, show the strings aligned, extra spaces for insertions or deletions, and 
flag mismatches. 


Exercise 2.16 
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In Exercise 2.11, can you speed the program up by skipping computations when it 
becomes clear that the best score you've already found can't be matched or bettered 
in that column? Why or why not? 


Exercise 2.17 


Run the approximate string matching program on large text, such as entire 
chromosomes. How is the performance? 


Exercise 2.18 


Make a version of the approximate string matching program that searches for the 
pattern in libraries of text, and then reports on, say, the top 25 hits. (For example, 
search for a DNA pattern in all the records of a GenBank library.) 


Exercise 2.19 


How can you alter the approximate string matching algorithm to give different weights 
to the three edit possibilities, or to different types of mismatches (e.g., by giving a 
fairly low weight to mismatches between hydrophobic amino acids or the codons that 
encode them)? 


Exercise 2.20 
Write an array that handles eleven dimensions. 
Exercise 2.21 


Answer the questions at the end of Section 2.3.3: Why do you need two exists 
tests? Is Sarray a hash or a reference to an anonymous hash? Can you implement it 
the other way? 
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Chapter 3. Object-Oriented Programming 
in Perl 


In Chapter 1, you saw how modules are defined and used, and in Chapter 2, how references 
and data structures work. Now, it's time to introduce the important concepts and techniques 
of object-oriented programming in Perl that are based on modules and references. 


Object-oriented (OO) programming is one of the most important approaches to writing 
programs, and it is an approach that has been well supported by Perl for quite a while. Other 
OO languages of interest include Java, C++, and Smalltalk. Many Perl modules are written in 
an OO style, and their proper use requires some fundamental understanding of the OO 
approach. Luckily, the key concepts are fairly simple. 


Perl easily supports both declarative and OO programming. (Perl was originally a declarative 
language only; the OO style was added fairly early on.) Declarative programming is 
characterized by code that declares variables and subroutines, conditional tests, if-else 
branches, and loops, and various arithmetic, logical, and string operators. It is up to you to 
manage the definition and use of the variables and subroutines so that they interact in 
appropriate ways. (You'll see shortly how object-oriented programming imposes additional 
constraints that help you create well-behaved programs.) Many declarative programming 
languages are well established, including Perl and such stalwarts as C, FORTRAN, and BASIC, 
to name just a few. By this point, assuming you have some experience programming in Perl, 
you should be fairly comfortable with the declarative style. 


The first part of this chapter is an overview of OO programming and how OO Perl modules are 
used. If you're a beginning Perl programmer, you'll find them easy to use because they rarely 
require you to know how to write OO Perl code. Depending on your needs and goals, this 
might be all the information you'll require from this chapter. 


As a more advanced programmer, you'll sometimes need to write your own OO 
bioinformatics software. If you're such a programmer, the second part of this chapter will be 
of greatest interest to you. However, because the material is developed incrementally, you 
will most likely want to read the chapter in order from beginning to end. 


Perl makes clever and simple use of existing mechanisms to support OO programming. Perl 
packages and modules are used to define OO classes, Perl references define OO objects, and 
Perl subroutines define OO methods. The definitions of these terms will become clear as you 
read the chapter, but in brief, OO software is organized into classes that contain data called 
objects. Subroutines called methods operate on the objects. 


Over the course of this chapter, I'll develop a small example object module, Gene.pm, to 
demonstrate the essentials of OO Perl. Gene.pm is developed in four stages so you can learn 
the OO style gradually. The final code for Gene.pm serves as a template from which you can 
begin developing your own OO software. 
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3.1 What Is Object-Oriented Programming? 


Object-oriented programming is a way to organize code so it interacts in certain prescribed 
ways, obeying certain rules about how the data and subroutines are organized. In other 
words, it imposes a certain programming discipline that can lead to better and more reliable 
code. 


The key idea of OO programming is that all data is stored and modified with special data 
structures called objects, and each kind of object can be accessed only by its defined 
subroutines called methods. The user of an OO class is typically spared the effort of directly 
manipulating data, and can use class methods for this instead. 


The promise of this OO structure of program code is that it makes the resulting programs 
cleanly designed, more reliable, easier to reuse in other programs, and easier to modify and 
improve. In essence, the approach imposes certain restrictions on what a programmer can 
do with the data and subroutines at hand. 


Proponents of the OO approach cite the benefits this extra discipline provides. It is certainly 
true that you can follow good programming practices without using an OO approach. 
However, OO does provide a well-defined framework for encouraging discipline and good 
programming practices. In a very flexible language such as Perl, good practices can 
sometimes be easier to enforce in the framework of OO. We'll see how this comes about in 
the examples that follow. 


3.1.1 Why Object-Oriented Programming? 


It is often important and necessary to weigh the costs and benefits of a given system against 
the alternatives in an applied engineering discipline such as programming. The decision to use 
OO programming, declarative programming, or some other paradigm, is often subject to 
religious debates, with some enthusiasts promoting their favorite approach against all 

comers. This is especially relevant to the Perl programmer, because Perl allows you to write in 
the declarative or in the OO style. You should know that OO programming isn't always the 
correct choice for a programming project. Despite the real benefits it can confer upon a 
software development project, it can also have certain costs; these costs and benefits should 
be weighed against each other. 


For instance, some types of software lend themselves more readily to abstracting with OO 
techniques than others. Object-oriented software development can sometimes take longer 
due to the overhead associated with its level of abstraction. OO software sometimes runs 
slower than other approaches; this has certainly been true in Perl, and although not usually a 
deal breaker, it is sometimes an important consideration. (Current work on the upcoming Perl 
6 is addressing this performance issue.) 


In spite of these strictures, OO programming is often an excellent choice; it has become a 
key approach to writing software in the Perl language. 


3.1.2 Terminology 
Object 
An object is a collection of data that logically belongs together. 


For instance, you might have a genome object that would have such attributes (or 
parts) as the name of the organism, the DNA sequence data, the start and end points 
for each exon, the genes with their associated lists of exons, and so forth. The exact 
nature of an object is a matter of logic and convenience, and in the end, it depends on 
the judgment of the programmer as to what collection of attributes makes an object 
that will accomplish the goals at hand. 
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A standard term in computer science for a collection of data is a data structure. Data 
structures are often studied in terms of accomplishing certain a/gorithms. It is often 
the case that the fastest, cleverest way to compute something relies upon putting the 
data into a special data structure that makes a fast algorithm possible.™ 


[1] This point really highlights the importance of spending some time on design at the beginning of 
your programming, and the importance of using a good efficiency tester such as Benchmark.pm to 
evaluate alternate solutions. 





In Perl, objects are usually implemented as references to hashes and are marked with 
their class name. The Perl function bless marks a reference with a class name, as will 
be explained in this chapter. The Perl function ref can be used to see what class name 
an object is marked with. 


Method 


Class 


In OO programming, a subroutine called a method is associated with an object. Each 
type of object has one or more methods that it can call. In this style of programming, 
the only way to access the data in an object is by the defined methods of the class. 
This restriction is meant to increase reliability, reusability, and maintainability. 


In Perl, methods are implemented as Perl subroutines, although they behave slightly 
differently from other subroutines. Methods are called from an object, and they know 
what object they're called on; they don't need to be explicitly imported, and they exist 
in a hierarchy of classes. If you know how to use Perl subroutines, methods are an 
easy next step. 


Together, the object definitions and the collection of methods for them defines a class. 
A specific object (say, a genome object for C. elegans) is called an instance of a class. 


Classes can be related to each other so methods can be inherited into a class from 
another class. In Perl, classes are implemented as namespaces, by means of the 
package directive. 


Now that you're familiar with the language of OO programming, let's see how it's used. 


4 PREVIOUS 
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3.2 Using Perl Classes (Without Writing Them) 


Before you actually start writing classes, it's helpful to know how to use them. This section 
shows you how to use OO Perl classes, even if the syntax is new to you and you've never 
written one yourself. 


Thanks to the large and active community of Perl programmers, there are many useful Perl 
classes already written and freely available to use in your programs. Very often, the class you 
want already exists. All you need to do is obtain it and use it. 


First, you need to find the appropriate module or modules (CPAN is the most common source 
for modules), install it, and examine the documentation to learn how to use the class. Finding 
and installing OO modules employs the same process covered in Chapter 1. 


What's different about OO modules is how they create data structures and call and pass 
arguments to subroutines. In short, there's some new syntax to learn that amounts to a 
slightly different style of programming. 


Object-oriented code creates an object by naming the class and calling a special method in 
the class (usually called new). The newly created object is a reference to a data structure, 
usually a hash. The object is then used to call methods (or OO subroutines). You're used to 
subroutines that get their data passed in as arguments; by contrast, OO code has a data 
structure that calls subroutines to operate on it. My goal in this section is to explain enough of 
this new terminology and syntax so you can read and understand the documentation for a 
class, and use it in your own programs. 


Let's begin with the documentation for the Carp module, a nonOO module that appears later 
in this chapter. This is a simple module to use; it defines four subroutines, and the 
documentation gives brief examples of their use. Because the Carp module comes installed 
with any recent release of Perl, you don't have to install it. To find out how to use it, type: 


perldoc Carp 


(Upper- and lowercase is significant; typing perldoc carp won't work.) Here's the beginning 
of the output: 





NAME 
carp - warn of errors (from perspective of caller) 
oeluck - warn of errors with stack backtrace 
(not exported by default) 
croak - die of errors (from perspective of caller) 
confess - die of errors with stack backtrace 
SYNOPSIS 
use Carp; 


croak "We're outta here!"; 


use Carp qw(cluck); 
cluck "This is how we got here!"; 


It shows what subroutines are available and how to use them in your code. Additional details 
do appear in the documentation. They are important and sometimes necessary to read 
carefully, but you usually don't need to delve any further than the SYNOPSIS section that gives 
examples. To use the croak subroutine, you first load it with the directive use Carp;. You 
then call croak by providing a string containing a message as an argument; the program 
prints the message and dies. 


The basic Perl documentation is found at http://www.perldoc.com and http://www.perl.com, 
and includes standard distribution modules such as Carp. A great many modules such as 
ioperl aren't shipped with the standard Perl distribution. To find the documentation for Bioper| 





Page 86 


on the web, start at the CPAN web site: http://www.CPAN.org. Once you locate Bioperl in 
CPAN, you'll be directed to the Bioperl home page at http://www. bioperl.org. If the modules 
in question are already installed on your computer system, type at the command line: 


perldoc bioperl 








Depending on how up-to-date your version of Bioperl is, you'll get something like: 
BIOPERL (1) User Contributed Perl Documentation BIOPERL (1) 




















NAME 
Bioperl - Coordinated OOP-Perl Modules for Biology 


SYNOPSIS 
Not very appropriate to put a synopsis - many different 
objects to use. Read on... 





DESCRIPTION 
Bioperl contains a number of Perl objects which are useful 
in biology. Examples include Sequence objects, Alignment 








objects and database searching objects. These objects not 
only do what they are advertised to do in the documenta 
tion, but they also interact - Alignment objects are made 
from the Sequence objects and so on. This means that the 
objects provide a coordinated framework to do computational 
biology. 

















If you are new to bioperl, reading biostart.pod will get 
you aquainted with writing scripts and the main players 
for the objects. 


We now also have a cookbook tutorial in bptutorial.pl 
which has embedded documentation. Start there if learning- 
by-example suits you most. 


Bioperl development is focused on the objects themselves, 
and less on the scripts (programs) that put these objects 
together. There are som xample scripts provided in the 
distribution, but it is not the focus of the objects that 
are distributed. Of course, as the objects do most of the 
hardwork for you, all you have to do is combine a number 
of objects together sensibly. 

















The intent of the bioperl development effort is to make 
reusable tools that aid people in creating their own site 
or job specific applications. 








The bioperl.org (http://bioperl.org) website also attempts 
to maintain links and archives of standalone bio-related 
perl tools that are not affiliated or related to the core 
bioperl effort. Check the site for useful code ideas and 
contribute your own if possible. 








The second paragraph of the DESCRIPTION advises you to read biostart.pod if you're new 
to Bioperl. Type: 


perldoc biostart 


and you get the following output (only the first page is reproduced here): 
BIOSTART (1) User Contributed Perl Documentation BIOSTART (1) 





NAME 





Bioperl - Getting Started 


SYNOPSIS 
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#!/usr/local/bin/perl 





use Bio::Seq; 
use Bio::SeqI0O; 








Sseqin = Bio::SeqIO->new( '-format' => 'EMBL' , -file => 
"myfile.dat'); 
Sseqout= Bio::SeqiIO->new( '-format' => 'Fasta', -file => 


VWPOULDUTL..fa’)i# 


while((my $seqobj = $seqin->next_seq( ))) { 
print "Seen sequence ",$seqobj->display id,", start of seq ", 
substr ($seqobj->seq,1,10),"\n"; 
if ( Sseqobj->moltype eq 'dna') { 





Srev = Sseqobj->revcom; 
Sid = $seqobj->display id( ); 
Sid = "Sid.rev"; 


S$rev->display id(S$id); 
S$seqout->write seq(S$rev) ; 


} 


foreach $feat ( Sseqobj->top_SeqFeatures( ) ) { 
if( $feat->primary tag eq 'exon' ) { 
print STDOUT “Location ",Sfeat-Sstart,"<:", 
Steat=>end," GRE[", sfeat-oglt .string,"]\n"; 


DESCRIPTION 
Bioperl is a set of Perl modules that represent useful 
biological objects. Some of the key objects represent: 
Sequences, features on sequences, databases of sequences, 
flat file representations of sequences and similarity 
search results. 











Because bioperl is formed from Perl modules, there are no 
actual useable programs in the distribution (this is not 
actually true. In the scripts directory there are a few 
useful programs. But not a great deal...). You have to 
write the programs which use bioperl. 








It is very easy to write programs using the bioperl mod- 
ules, as a lot of the complex processing happens in the 
modules and not in the part of the program which you have 
to write. The idea is that you can connect up a number of 
the modules to do useful things. The synopsis above gives 
a simple script which uses bioperl. Stepping through this 
script, the lines mean the following things: 





The file gives an example of code that uses Bioperl modules, and the rest of the biostart 
documentation explains the example in some detail. Let's take a closer look at the OO syntax 
of this example. 


After the typical use statements (needed to load the modules), such as: 
use Bio::SeqI0O; 


the documentation has the following line in the example: 





Sseqin = Bio::SeqIO->new( '-format' => 'EMBL' , -file => 'myfile.dat'); 


This line calls the new method. new is the name typically used in OO Perl for the subroutine 
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that creates an object. The object that's returned from the method call is saved as a 
reference in a scalar variable. In this case, the new object in the Bio: :SeqIo class is saved in 
the reference variable $seqin. 





The new method is identified by giving the name of the class (Bio: :SeqI0), followed by an 
arrow (->), and finally the method name: 





Bio: :Seql0O->new 


This is the syntax for calling methods. If you're just interested in using the class, not in 
understanding the inner mechanisms, you simply have to remember to invoke the new 
method that creates a class object in this way. The other methods in a class are typically 
called from a class object that has been previously created (by just such a call to a new 
method). 


Later in the biostart example you see the line: 

while ((my $seqobj = $seqin->next_seq( ))) { 

The call to the method next_seq is done as follows: 

Sseqobj = S$seqin->next_seq( ) 

Here, the Bio: :SeqIO class object $seqin is being used to call the method next_seq in the 
class Bio: :SeqI0O. Because $seqin was created as an object in the class Bio: :SeqIo, it can 
be used with arrow notation (->) to call a method in the class, without specifically mentioning 


the class name. This is how methods other than new are typically called. The result here is 
saved as $seqobj, a new object. 








The important thing to remember about subroutines in OO code is that you create objects by 
calling the new method in the class. You call other methods by calling them on an object in the 
class. Both types of calls are accomplished with arrow notation. 

Here's a new object being created in a class Myclass: 

Smyobject = Myclass->new( ); 

Here's a method compute being called on that object: 

Smyobject->compute( ); 

One more item in the biostart example that may appear unfamiliar is the way arguments to 
the methods are specified. Consider this line from the example: 

Sseqin = Bio::SeqIO->new( '-format' => 'EMBL' , -file => 'myfile.dat'); 





The arguments are passed as named arguments, which are pairs with the name of the 
argument followed by the symbol => followed by a value for the argument. If this looks 
suspiciously like the notation used to initialize hashes, it's no accident. In this invocation of 
the new method, you're initializing a new object, which is implemented as a hash, so it makes 
sense that you'd pass your initial values to the object using hash notation. 


The details of how arguments are passed to methods are covered later in the chapter. For 
now, if you just want to use this class, you'll need to pass your arguments to the new method 
in the style just shown. Methods in a class usually pass arguments in this hash-like, key => 
value notation, but not always. If you use the syntax as shown in the documentation, your 
code will be fine. 


One advantage to using a hash for arguments is that the arguments can be given in any 
order: you don't have to give them in a prescribed order as is often the case when passing a 
list of scalars as arguments to a subroutine. 


I'll return to this biostart example in Chapter 9. For now, know that you should have 
sufficient syntax information to use a Perl OO module. The next section shows how to start 
pulling it all together. 


4 PREVIOUS 
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3.3 Objects, Methods, and Classes in Perl 


To the user of a class, the most important piece of information is the interface describing how 
to use the class. Usually this interface can easily be summarized in a few examples provided 
by the author of the class. The details of how the class is implemented may change; as long 
as the interface remains the same, your code needn't change even when the internals of the 
class you're using change. This provides good modularization, protecting your system from 
the ripple effects of changes in individual components, thus making your code as a whole 
more robust and reliable. This is one of the main benefits of OO design. 


With OO design, you know that: 
e Aclass is a package. 


e An object is a reference to a data structure in the class, that is marked (or "blessed") 
with the class name. 


e A method is a subroutine in the class. 


Several other concepts and their associated terminology are also important in object-oriented 
programming. For instance, inheritance enables one class to use the definitions from another 
class, while adding to or changing the definitions. 


At this point you may be wondering: if an object is really just a data structure (usually a 
reference to a hash), and if a method is really just a subroutine, then why all this new 
terminology? The answer is that the framework imposed upon these data structures and 
subroutines, in which each data structure has a defined set of subroutines that alone can 
access the data structure, is indeed a new level of abstractiona new set of constraints 
influencing the programming structure. 


These constraints have proved to be so frequently useful that the new terminology of class, 
object, and method does say something new about the data structures and subroutines 
involved. Also, there are some new features such as bless and the arrow notation that cause 
subroutines to behave a little differently, as has been mentioned already, and will be explained 
in detail. But surprisingly few such additions are needed to transition from a knowledge of 
Perl's declarative style to Perl's OO style. 


When you program using a defined class with its methods and objects, you can gain access 
to the class data only with the class methods provided by the class designer. The restriction 
of access to a class's data to the methods alone is called encapsulation. From the standpoint 
of the programmer using the class, exactly how the methods and objects are implemented 
isn't necessarily a concern. As a programmer using the class, you can regard the class as a 
black box; you don't have to look inside to see how it is implemented. Usually, it's only the 
programer writing the code who needs to worry about that. 


One final point: in the field of OO programming, different authors define terms in different 
ways and present differing sets of essential concepts. Be warned that considerable diversity 
exists in the literature and among the languages that deal with object-orientation. The 
excellent book Object Oriented Perl (see Section 3.14) includes a table that matches basic 
concepts with some of the alternate terminologies for those concepts. 


3.3.1 Perl Objects Are Usually Hashes 


In OO Perl programming, most objects are implemented as hashes and are accessed with 
blessed references to the hashes. 


Each individual component of data in an objectfor example, the name of a geneis called an 
attribute and is assigned a value. The attribute/value nature of the component of an object is 
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well suited to be represented as a key/value pair in a hash data structure. 


It is quite possible, and sometimes desirable, to use other data structures such as scalars or 
arrays as the basis for objects. However, most situations are well handled by the hash data 
structure. Its benefits include having a name (key) for the data (value), allowing arguments to 
be specified in any order, and very fast lookup speed. 


In this book, all objects are based on the hash data structure. 


Team LiB 
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3.4 Arrow Notation (->) 


Object-oriented Perl code uses arrow notation (->) to call methods. Understanding how this 
works is essential to understanding OO Perl. Before you start reading OO Perl code, let's look 
more closely at its main features and how arrow notation is used to call methods. 


7] The arrow (->) also appears in Perl when dealing with complex data structures, as you saw in Chapter 2, 


where it's used for references to subroutines. 


The arrow notation is used on an object to call a method in a class. Because the object has 
been blessed (i.e., marked with the class name), Perl can tell from the object what class it's 
in and so knows to find the method in that same class. With arrow notation, Perl also passes 
to the method a new argument that automatically appears first in its argument list. The other 
arguments are shifted over; the first argument is now the second, and so on. The automatic 
passing of a new first argument to the method is the key to understanding OO Perl! code. 


The method name appears to the right of the arrow. Perl then uses what's immediately to the 
left of the arrow to identify the class in which to find the method. It also passes information 
about what's on the left of the arrow to the method, where it appears as the first argument 
of the method. The left side of the arrow may be in one of two forms: 


e The name of the class. Here's an example: 
TRNA->new(  ); 


Here Perl sees that the left side is the name of the TRNA class; therefore it calls the new 
subroutine in the TRNA package. It also automatically passes the name TRNA to that 
subroutine as its first argument, shifting any other arguments that may have been 
explicitly given (you'll see how this feature is used in later examples). You need to save 
the new object (a blessed reference to a hash) using the assignment operator = as 
follows: 


$trna_object = TRNA->new( ); 
e An object. Here's an example: 
$trna_object->findloops( ); 


Perl sees that on the left side of the arrow $trna_object is an object of the TRNA 
class; it therefore calls the findloops method in the TRNA class. (It can see that 
$trna_object is a TRNA object because the object was blessed into the TRNA class 
when it was created by the new method, as I'll explain later.) Perl also passes a 
reference to the $trna_object object into the findloops method as the first 
argument to the method, shifting any other arguments that may have been explicitly 
given. 


Why does Perl do it this way? The short answer to that question is that once you understand 
how it works, your code will become simpler and more usable. You will need to type class 
names less frequently, and you can use methods written for one class in another class 
(inheritance). 


The two new tricks that Perl performs here are: 
e Using arrow notation to find the correct method in the correct class 
e Passing the method a new first argument 


It can find the correct class because objects are blessed with their class name. 


Team LiB 


Page 93 


[Team LiB J 


3.5 Gene1: An Example of a Perl Class 


This first example of Perl code defines a very small Perl class. This code does several new 
things, which I will explain in detail after the code. 


This first version of the class is called Genel, and it demonstrates the essential features 
needed to implement a simple class. Genel looks similar to the last module definition in 
Chapter 1, but with a few new wrinkles that transform it into OO software. I progress from 
Genel.pm to Gene2.pm, then to Gene3.pm, and then to the final version, Gene. pm 


The methods of the Genel class will permit creating Gene1 objects and finding out what the 
values of a Genel object's attributes are. 


Here's the module that implements a Genel class. I put the module into a file called Gene1.pm 
and place it into a directory on my computer that can be found when Perl needs it. I continue 
putting my code into my own development library directory, which on my Linux system is the 
directory /home/tisdall/MasteringPerlBio/development/lib. This directory is pointed out to Perl 
at the beginning of the testGenel program that appears later as an example of how to use 
the Genel.pm module definition. You will probably use a different directory on your computer, 
in which case you'll have to change this line: 


use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 
You can also put the directory on the command line or set the PERLSLIB environmental 


variable, as described in Chapter 1. Setting the PERL5LIB variable is the easiest because you 
don't have to change the use 1ib lines in the programs. 


A Genel object consists of a gene name, an organism represented by genus and species, a 
chromosome, and a reference to a protein structure in the PDB: 


package Genel; 
use strict; 
use warnings; 


use Carp; 


sub new { 


my ($class, %arg) = @ ; 

return bless { 
_name => Sarg{name} || croak("no name"), 
_organism => Sarg{organism} || croak("no organism"), 
_chromosome => Sarg{chromosome} [| e222"; 
_pdbref => Sarg{pdbref} [| Serre", 


}, Sclass; 


sub name { $ [0] -> {_name} } 
sub organism { $_[0] -> {_organism} } 
sub chromosome { $ [0] -> {_ chromosome} } 
sub pdbref { °S[0] =>-{pdbret} } 
Ap 


That's the whole thing! 





Woe 


Hash keys that are simple words such as "name, 
« g. don't need to have quotes around them when they appear within their 
surrounding curly braces: Sarg{name} means the same thing as 
Sarg{'name'}. 


organism," and so on, 
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Here's a small program that uses the Genel. pm class: 


use strict; 
use warnings; 


use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 


use Genel; 
print "Object 1:\n\n"; 


my Sobj1l = Genel->new ( 


name => "Aging", 
organism => "Homo sapiens", 
chromosome => "23", 

pdbref => "pdb9999.ent" 


); 


print Sobjl->name, "\n"; 

print S$objl->organism, "\n"; 
print Sobj1->chromosome, "\n"; 
print $obj1->pdbref, "\n"; 


print "Object 2:\n\n"; 

my Sobj2 = Genel->new ( 
organism => "Homo sapiens", 
name => "Aging", 


)3 


print S$obj2->name, "\n"; 


print $obj2->organism, "\n"; 
print S$obj2->chromosome, "\n"; 
print $o0bj2->pdbref, "\n"; 
print “Object B2\n\n"; 


my Sobj3 = Genel->new ( 


organism => "Homo sapiens", 
chromosome => 23", 
pdbref => "pdb9999.ent" 


); 


print S$obj3->name, "\n"; 

print Sob7j3=>Sorganism, "\n"; 
print $obj3->chromosome, "\n"; 
print $obj3->pdbref, "\n"; 


If I put that demonstration program code into file testGenel1, I get this output from running 


that demonstration program with perl testGenel: 
Object 1: 


Aging 

Homo sapiens 
23 
pdb9999.ent 


Object 2: 


Aging 

Homo sapiens 
2??? 

2??? 
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Object 3: 


no name at testGene line 35 


Now, let's take a look at how this Perl code works. 


L 
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3.6 Details of the Gene1 Class 


In this section, I introduce the OO features used to make a class in Perl. First however, I 
explain the variable naming convention I use, as well as the handy Carp module. 


3.6.1 Variable Names and Conventions 


Using an underscore in front of a name is a programming convention that usually indicates 
that the item in question (e.g., a variable or hash key) isn't meant for the outside world but 
only for internal use. 


This is just a convention; Perl doesn't require you to do it. It will, however, make your code 
easier to read and understand. 


I generally follow this convention and put underscores in front of names that I don't want 
directly accessed by the programmer using the class. (In Perl, unlike some more strict OO 
languages, you can access data that's internal to a class, which make this naming convention 
that distinguishes internal variables particularly useful.) 


Thus, in my Genel class, the attributes name, organism, chromosome, and _ pdbref are 
used internally only as the hash keys for the attributes in the object. When you use the class, 
as I do in my example program testGenel, you don't even have to know these names exist. 


The interface is through arguments that specify the initialization values of these attributes. 
These arguments are called name, organism, chromosome, and pdbref. I also have methods 
the subroutines also called name, organism, chromosome, and pdbrefthat return the value of 
the actual attributes stored in the object. 


3.6.2 Carp and croak 
The Carp module is called near the top of Genel .pm with use Carp;. 


Carp is a standard Perl module that provides informative error messages in the case of 
problems. carp prints a warning message; croak prints an error message and dies. They are 
very much like the Perl functions warn and die; they report the line number in which the 
problem occured in the error message and report from what subroutine they were called. I 
use croak in my code; it prints out the error message provided, names the file and the line 
number and subroutine where it's called from, and then kills the program. 


This function is certainly useful during development because it's another way to find errors in a 
program as it's being created. It also gives program users the ability to report the exact 
location of a problem, should one occur, to the programming staff (which may be just one 
programmer, you!). 


In my program output, the Carp message is: 
no name at testGenel line 35 





It's produced by the line: 


_name => Sarg{name} || croak("no name"), 


in the Genel.pm module file. Line 35 of testGenel is the beginning line of this part of the 
program: 


my Sobj3 = Genel->new ( 


organism => "Homo sapiens", 
chromosome ao, DS, 
pdbref => "pdb9999.ent" 
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It's the part of the code that tries to do something bad: it's trying to initialize a new object 
without setting its name. You'll see how this works in more detail in the following sections. 


3.6.3 The new Constructor Method 


To create objects, I defined a special constructor method called new. A call to new returns a 
new object, properly initialized. The new object is also marked as a member of the class, in 
this case the class Genel. 


sub new { 


my ($class, %arg) = @ ; 

return bless { 
_name => Sarg{name} || croak("no name"), 
_organism => Sarg{organism} || croak("no organism"), 
_chromosome => Sarg{chromosome} [| “e22o", 
_pdbref => Sarg{pdbref} [lee 


}, Sclass; 


} 


Note that each class may have its own requirements for creating a class object, and so each 
class's constructor method may be different than that for another class.” For instance, a 
constructor may or may not provide default values for its attributes. Still, there's a lot of 
similarity between the constructor method of most classes. 


(1 Ih particular, a constructor method may have any name in Perl; you could call it constructor, 
OverTheSun, or anything that you choose. Most programmers just use the very familiar name new. 


Let's dissect the code of the constructor new. You'll see how objects are marked as members 
of a class, and initialized, by their constructor methods. Here are the main novelties: 


e The package name Genel is automatically passed to the subroutine new as its first 
argument, even though it isn't included in the argument list. 


e The returned hash reference is marked with the name Genel (using the bless 
function) thus making it an object in the Genel class. 


Everything else here is straightforward Perl subroutine code. 


Note that the call to new in the demonstration program testGenel is made as follows: 

my Sobj = Genel->new( ... ); 

The scalar variable $obj is a reference that points to the anonymous hash that's returned 
from the new method. The object is a hash that contains the attributes of the object, namely 


the key/value pairs of the hash. As usual the reference variable $obj is lexically scoped with 
my. And, as you see, $obj is marked with the class name Genel. 


The call Gene1->new includes the name of the package Genel in which the new subroutine is 
defined. The package name is the class name; the name of the module file in which the class 
is defined must be the package name with .pm added. So you have a class Genel in a module 
file Genel .pm that has the declaration package Genel;. 


The call to new with its arguments is of the form: 
Genel->new( keyl => 'valuel', key2 => 'value2', ... ) 





This call does two important things: 
1. It calls the new subroutine in the Genel package. 


2. It passes the name Genel of the package to the new subroutine as its first argument. 
Therefore, in the new subroutine, in the line that collects the arguments: 


my (S$class, Sarg) = @ ; 
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the first argument is automatically the string Gene1 and is assigned to the variable 
Sclass. 


This first argument Genel isn't listed in the usual place in the parentheses after the subroutine 
name in the call to the subroutine: 


new( keyl => 'valuel', key2 => 'value2', ... ) 


It happens automatically when the package name is used with an arrow (->): 





Genel->new( keyl => 'valuel', key2 => 'value2', ... ) 


This may seem a bit odd, but it has the desirable advantage of making it unnecessary to type 
the class name Genel twice: once to call the new method in the Genel package, and again to 
pass the class name Genel to the new method. Instead of typing: 


Genel->new( "Genel", keyl => 'valuel', key2 => 'value2', ... ) 





you can just type: 
Genel->new( keyl => 'valuel', key2 => 'value2', ... ) 





It's simply a bit of handy syntax the designers of Perl added to save a bit of typing when 
writing OO code in Perl, nothing more or less. 


Now, let's examine the innards of the new constructor method. The new constructor method 
has the form: 
sub new { 

my ($class, %arg) = @ ; 

return bless { 


}, Sclass; 


} 


First, notice that in addition to assigning the first argument, the class name Genel, to the 
variable $class, the subroutine captures the rest of the arguments in the hash variable %arg. 


Recall, from your previous study of Perl, that initializing a hash by assigning a list to it causes 
the items in the list to be treated as key/value pairs in the hash. For example, if the 
arguments are: 

('Myclass', mykeyl => 'myvaluel', mykey2 => 'myvalue2') 


the scalar variable $class gets the value Myclass, and the hash variable sarg gets two 
key/value pairs initialized to the key 'mykey1' with the value 'myvaluel', and the key 'mykey2' 
with the value 'myvalue2', Also recall that => is a synonym for a comma.” 


'] Tt also forces its left side to be interpreted as a string and removes the need to surround the string in 


quotes, which is exactly what I want here. 


3.6.4 Creating an Object with bless 


The new constructor then returns the value of: 
bless { ... }, Sclass; 


The built-in Perl function bless does a very simple thing, but it's enough to take a data 
structure and make it an object in a class. It marks a reference with a class (package) name. 


In this code, bless takes two arguments. The first, delimited by a pair of curly braces, is an 
anonymous hash, which you'll recall is a reference to an unnamed hash. This anonymous 
hash contains the data of the resulting object. The second argument to bless is just the 
name of the class, as it was saved in the $class scalar variable. 


This call to bless returns a hash that is "marked" with the name of the class. The hash that 
bless marks is then given to the return function to serve as the returned value of the new 
method. 
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The object reference that is returned can now be identified as an object in the class Genel. 
The object reference in this example is marked with the name Genel and has a hash as its 
top-level data structure. The new method in the class creates a new object in the class. 


Although the first argument to bless in this code is an anonymous hash; in general, it can be 
any reference to a data structure that serves as an object. It can be a reference to a scalar, 
an array, a hash, or a more complex data structure. In the example, I am just declaring an 
anonymous hash in place rather than providing a reference to an existing hash. So, for 
example, if I declare a hash and a reference to it like so: 

shash = ( keyl => 'valuel', key2 => 'value2' ); 

Shashref = \%Shash; 


then I can bless the hash, mark it with the class name HashClass, and save the resulting 
object: 

Shashobj = bless Shashref, 'HashClass'; 

Alternatively, the same object $hashobj can be created using an anonymous hash, and one 
call to bless: 

Shashobj = bless { keyl => 'valuel', key2 => 'value2' }, 'HashClass'; 


3.6.5 Using ref to Report an Object's Class 


The Perl function ref reports on the type of element referred tovariable, object, code, etc. If 
the variable is blessed, ref reports on the class it is marked with. 


After the call to new to create the Genel object $obj, the line: 
prant tet Sobj, "\n"? 


prints out as Genel. 


The Perl function ref returns false if its argument isn't a reference. If it is a reference, it 
returns one of the following: 


SCALAR 
ARRAY 

HASH 
CODE 
REF 
GLOB 
LVALUE 

















If the reference has been blessed into a package, that package name is returned from the 
call to ref. 


3.6.6 Initialize an Object with an Anonymous Hash 


Here again is the complete definition of the new method in the Genel class: 


sub new { 


my ($class, %arg) = @ ; 

return bless { 
_name => Sarg{name} || croak("no name"), 
_organism => Sarg{organism} [| eroak("no organism”), 
_ chromosome => Sarg{chromosome } Pi M2222, 
_pdbref => Sarg{pdbref} [i “22r2", 


}, Sclass; 


} 
The first argument to bless is the following anonymous hash: 


{ 


_name => Sarg{name} || croak("no name"), 
_ organism => Sarg{organism} | |, eroak ("no organism") » 
_ chromosome => Sarg{chromosome } [I Bez o 
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_pdbref => Sarg{pdbref} It ee 
} 


As should be familiar (if not, see Appendix A for a Perl refresher), the key/value pairs are 
separated by the "syntactic sugar" symbol =>. The keys are in the first column; the variable 
names name, organism, chromosome, and _pdbref are used as the names of the keys. 


The desired values are in the second column, following the => symbol. They are given in the 
form of a Perl logical OR operator. The value has either been passed in, or the default value is 
used: 


value || default 


The values are the values assigned from the argument list to the hash arg upon entry to the 
subroutine. If all these arguments are passed to the new method, the hash initializes its four 
keys (_name, organism, chromosome, and _pdbref with those values). 


If chromosome or pdbref, is passed to the new method, those values of arg aren't defined, 
and the subroutine assigns the default value (the string ????) to the missing keys ( 
_chromosome, pdbref, or both). 


If name or organism aren't passed as arguments to the new method, their values in arg 
aren't defined, and by default, the subroutine calls croak and the program exits with an error 
message. 


Let's look closely at a line in the Gene1.pm module that calls croak: 


_name => Sarg{name} || croak("no name"), 


This line is part of a hash initialization. It is initializing an entry with a key _name. The value to 
be associated with this key is given as: 


Sarg{name} || croak("no name") 


This sets the value of the key to the value $arg{name} if that value exists. If Sarg{name} 
doesn't exist, the value croak("no name") is evaluated. The behavior of | | (the or Boolean 
operator) is that the first argument is evaluated. The second argument is evaluated only if the 
first argument evaluates to false. In this code, the second argument kills the program and 
prints an error message when it is evaluated. This is a bit of a trick, but it's a common one 
that's used in several programming languages that have the Boolean or operator. 


Now that you've seen how the new constructor handles its arguments, let's look again at how 
the test program testGenel calls the new method, which it does three times: 


my Sobj1l = Genel->new ( 


name => "Aging", 
organism => "Homo sapiens", 
chromosome => "23", 

pdbref => "pdb9999.ent" 


)3 


my Sobj2 = Genel->new ( 
organism => "Homo sapiens", 
name => "Aging", 

\; 


my Sobj3 = Genel->new ( 


organism => "Homo sapiens", 
chromosome => "23", 
pdbref => "pdb9999.ent" 


)3 


The key/value pairs (the keys are the attributes of the objects) are passed to the new 
method. Notice that, due to the use of the arg hash to capture these arguments by new, the 
order in which the arguments are passed isn't important. This is a nice convenience when 
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creating and initializing objects because there are often many attributes and some may or 
may not be initialized; being able to ignore the order of the arguments when you call new 
makes it easier to program. Recall that it's a general property of Perl hashes that the order of 
the keys isn't important; it has to do with how hashes are implemented, and why they're so 
fast at retrieving values. 


You'll recall that the use of croak in the new method requires the initialization of the name and 
organism attributes. For instance, $0b 3 isn't created with an initial value for the name 
attribute. The new subroutine was defined to require such an initial value, which makes sense 
because, at the least, I want every gene in my program to have a name and an originating 
organism. The output of the testGenel program shows that this third call to new triggers the 
croak exit mechanism. 


3.6.7 Accessor Methods 


Accessor methods are subroutines in the class that return the values of the class attributes. 
These attributes are usually implemented as keys of the hash that serves as the class object. 
You can access the attributes of an object, and their values, directly; for example, given an 
object of the Genel class, you can print out its name like so: 

print Sobj->{ name}; 


This gives the value of the key _name in the anonymous hash pointed to by $obj. This works; 
however, it's not good OO style. It directly accesses the data in the object; good style 
requires you to access the data through subroutines defined for that purpose. It is preferable 
to restrict all access of an object's attributes to the use of specific methods. 


The actual attribute is called name. This is initialized from the value of the argument name in 
the initialization of the arguments, as in this line from new: 
_name => Sarg{name} || croak("no name"), 


That was just a convenient way to pass arguments to new, SO you Can Say: 








new( name => 'Ecoli' ) instead of new( name => 'Ecoli' ) 


But you can just define a subroutine called, conveniently, name that returns the value of the 
attribute $obj->{_ name}. 


In my program, I have defined a method for each key in the hash. I have method name, which 
accesses the value of the key _name; I also have a similar method for each other key. Here's 
how to define a method to access the value of the key _name: 


sub name { $ [0] -> {_name} } 
This is called by the following line in the testGenel program: 
print Sobjl->name, "\n"; 


It calls the method name for the object, which then accesses the value of the key _name in the 
object. In this way the actual implementation of the data that is stored in the object is kept 
hidden from users of the class methods. If the data is retrieved with a method, and if the 
author of the class decides at a later date to change the way the object stores its data, the 
users of the class can still get at the data by making the same method call. Only the internals 
of the method call will change; the behavior of the method, namely what arguments you give 
it and what return values you expect from it, stay the same. When the interface remains the 
same, the code that uses the class can also remain the same, saving everybody time and 
trouble, even when new versions of the class are developed. 


The method name receives the object as its first argument because it is called by: 
Sobj1->name 


The body of the subroutine uses the Perl built-in @ array to access its arguments. The first 
argument to the subroutine is referred to as $_[0]. That first argument is the object, a 
reference to a hash, so I give it the key _name to retrieve the desired value: 
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$ [0] -> {_name} 


Finally, since by default a subroutine returns the value of the last statement executed, this 
subroutine returns the gene name it has retrieved from the object. 


L 
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3.7 Gene2.pm: A Second Example of a Perl Class 


Genel demonstrated the fundamentals of a Perl class. Now, I'll build a more realistic example, 
which also includes a few additional standard Perl techniques. 


My goal is to present an example that you can imitate in order to begin to develop your own 
OO software. I'm going to build the example in three more stages, expanding upon the 
Genel.pm module. First, I'll add mutators, which are methods that alter the data in an object. 
I'll also add a method that gives information about the class as a whole, returning the count 
of how many objects in the class exist in the running program. This depends on the use of 
closures, methods that use variables declared outside the methods. This is the new material 
in the Gene2.pm module. 


After that step, I introduce the AUTOLOAD mechanism, which gives a single class method 
called AUTOLOAD that can define large numbers of other methods and significantly reduce the 
amount of coding you need to write to develop a more complex object (among other 
benefits to be described later). That will be the Gene3.pm module. 


We'll end up with a Gene.pm module you can use as a basis for your own Perl module 
development. It will add a mechanism to specify what properties each attribute has (which 
can prevent improper data manipulation, for instance). It will show how to initialize an object 
with class defaults and how to clone an existing object. Finally, Gene.pm will show you how to 
incorporate the documentation for a class right in the Perl code for the class. 


Here is the code for the intermediate Gene2.pm module. Following the Gene2.pm module is an 
example of the code and output of a small test program that drives the module. Take a 
minute to look at these two code examples, especially at the comments. The module 
Gene2.pm contains several new details that will be discussed following the code. The test 
program should be fairly easy to read and understand. 


package Gene2; 

# 

# A second version of the Gene.pm module 
# 


use strict; 
use warnings; 
use Carp; 


# Class data and methods, that refer to the collection of all objects 
# in the class, not just one specific object 

{ 
my $ count = 0 
ub get _count 
$count; 


a 


~ Ne 


n 


ub aner count, { 
++$ count; 





n 


ub _decr_ count { 
—=$_ County 








} 


# The constructor for the class 
sub new { 

my ($class, %arg) = @ ; 

my Sself = bless { 
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_ name => Sarg{name} croak("Error: no name"), 





| | 
_organism => Sarg{organism} || croak("Error: no organism"), 
_chromosome => Sarg{chromosome}|| "????", 
_pdbref => Sarg{pdbref} le Pee eens 
}, Sclass; 
Sclass->_iner count( ); 


return Sself; 


} 


# Accessors, for reading the values of data in an object 








sub get_name { $_[0] -> {_name} } 
sub get_organism { $_[0] -> {_organism} } 
sub get_chromosome { $ [0] -> {_ chromosome} } 
sub get _pdbref { $_[0] -> {_pdbref} } 
# Mutators, for writing the values of object data 
sub set name { 

my ($self, Sname) = @ ; 

$self -> { name} = $name if $name; 


sub set_organism { 
my ($self, Sorganism) = @ ; 
$self -> {_organism} = Sorganism if Sorganism; 


sub set chromosome { 
my ($self, Schromosome) = @ ; 
$self -> { chromosome} = $chromosome if $chromosome; 


sub set _pdbref { 

my ($self, Spdbref) = @ ; 

Sself -> { pdbref} = S$pdbref if Spdbref; 
} 


1; 

Here is the small test program testGene2 that demonstrates how to use the objects and 
methods in this version Gene2 of our OO class: 

#! /usr/bin/perl 


n 
# Test the second version of the Gene module 


use strict; 
use warnings; 





Change this line to show the folder where you store Gene2.pm 
use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 
use Gene2; 


Create object, print values 








print "Object 1:\n\n"; 


my Sobj1l = Gene2->new ( 


name => "Aging", 
organism => "Homo sapiens", 
chromosome => 23") 

pdbref => "pdb9999.ent" 


); 


print Sobjl->get_name, "\n"; 
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print Sobjl->get_organism, "\n"; 
print Sobjl->get_chromosome, "\n"; 
print Sobjl->get_pdbref, "\n"; 


# 


# Create another object, print values ... some will be unset 


# 
print "\n\nObject 2s\n\n"; 


my Sobj2 = Gene2->new ( 
organism => "Homo sapiens", 
name => "Aging", 


); 


print Sobj2->get_name, "\n"; 

print Sobj2->get_organism, "\n"; 
print Sobj2->get_chromosome, "\n"; 
print Sobj2->get_pdbref, "\n"; 


# 

# Reset some of the values, print them 
# 

Sobj2->set_name ("RapidAging") ; 
Sobj2->set_chromosome ("22q") ; 
Sobj2->set_pdbref ("pd£9876.ref") ; 











print: "\n\n''s 

print Sobj2->get_name, "\n"; 

print Sobj2->get_organism, "\n"; 

print S$obj2->get_chromosome, "\n"; 

print Sobj2->get_pdbref, "\n"; 

print "\nCount is ", Gene2=>get. count; "\n\n"; 

# 

# Create another object, print values: but this fails 
# because the "name" value is required (s the "new" 
# constructor in Gene2.pm) 

# 


print "\n\nObject 3:\n\n"; 


my Sobj3 = Gene2—->new ( 


organism => "Homo sapiens", 
chromosome => "23", 
pdbref => "pdb9999.ent" 


My 


print. "\nCount is", Gene2=>ger. count, "\n\n"; 
Finally, here's the output from the test program testGene2: 
Object 1: 

Aging 

Homo sapiens 


Zo 
pdb9999.ent 


Object 2: 


Aging 
Homo sapiens 
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eee? 
eee? 


RapidAging 
Homo sapiens 
22g 
pdf9876.ref 


Count is 2 


Object 3: 





Error: no name at testGene2 line 68 


It's a good idea to take a moment to read through this Gene2.pm module, the test program 
testGene2, and the output. Compare this new Gene2 module with the earlier Genel module. 
In particular, notice where the methods are defined in the module, and then how they are 
actually used in the test program. Don't get hung up on the details in this first reading; just 
look at the overall picture. Notice that the definitions are all in the module Gene2.pm, which is 
then loaded at the beginning of the test program testGene2; itis testGene2 that actually 
creates the module's objects and uses the module's methods on those objects. In other 
words, testGene2 is a program; Gene2.pm is a definition of a class that is used in testGene2. 





Let's begin examining the module code. 
3.7.1 Closures 


A closure keeps track of class data. Class data refers not to a particular object, but to 
several, possibly all, objects of a class that have been created during the running of your 
program. This is frequently important to do. For instance, say you have a DNA sequencing 
pipeline that can handle only 20 sequences at any one time. You'd want your controlling 
program to block any attempt to create more than 20 sequence objects until the pipeline is 
ready to receive more. To do this, you would keep a count of how many sequence objects 
your controlling program has created. Closures are a way to program such class data. 


A closure is a subroutine that uses a variable defined outside the subroutine. By surrounding 
such a variable and some closures that use that variable within a block, you can use the 
closures to access the variable from anywhere in the program, and the variable will never go 
out of scope and lose its value. This section will explain how this works and how to use it in 
your code. 


The following code is new in Gene2.pm: 


# Class data and methods, that refer to the collection of all objects 
# in the class, not just one specific object 


{ 


my $ count = 0; 
sub get count { 
$ count; 


n 


ub incr count { 
++$ count; 





n 


ub _decr_count { 
==$_ count? 








} 


This code creates a variable $ count. $ count is a lexical my variable in a block of curly 
braces, and therefore is hidden from all parts of the code except within the block. The three 
methods that are also defined in the same block use the variable $_count.This variable 
persists throughout the life of the program because the subroutines defined with it are 
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closures. For example, in the code for the class module Gene2.pm, I use $_count to keep a 
count of how many objects are in existence at any given time. Notice that the method 
names incr count and _decr count begin with a leading underscore, as does the variable 
name $_ count. They aren't meant to be called by the user of the class but are internal to the 
module. On the other hand, the remaining method get_count doesn't begin with a leading 
underscore and is meant to be called whenever the user of the class wants to know what the 
count is. 


The previous section of code implements a closure. It is surrounded by curly braces creating a 
Perl block. You've seen many blocks associated with loops and conditionals as you learned 
the fundamentals of Perl. The block here stands on its own without being a part of another 
programming construct. 


Any block, this one included, creates a new scope for the variables that occur within it. my 
variables (also called lexical variables) within a block exist only while the program is 
executing the statements within that block. When a program leaves a block by passing 
beyond its closing curly brace, the my variables within it go out of scope. In other words, they 
cease to exist, and disappear from the program until the program reenters the block, and 
they are created anew. 


The preceding paragraph is correct; however, there is one important "but." 


Subroutine definitions don't go out of scope in the way that lexically scoped (my) variables do. 
It is also possible for a subroutine definition to affect the behavior of a lexically scoped 
variable. Aha. Read on. 


To repeat: subroutine definitions aren't subject to the same constraints as variables in 
regards to my and blocks. In fact, a subroutine definition is global to the entire package in 
which it's declared. Perl looks for subroutine definitions at compile-time, before actually 
running the program, and makes a subroutine definition available to an entire package no 
matter where the subroutine is declaredeven if it's declared in a conditional block that's never 
reached during runtimewhen the program code is actually executed. 


As an example, here is a small program with a subroutine definition: 
= 


# A program to demonstrate the global nature of subroutine definitions 


# 
my Sdna = 'ACGT'; 


if ($dna eq 'ACGT') { 
print "This statement gets executed\n"; 
print "Here's the subroutine call:\n"; 
isdna(S$dna) ; 








} else { 





print "This statement does not get executed\n"; 
# 
# The following subroutine definition is in a block which is 
# never executed at runtime. 
# 
sub isdna { 

# Print the argument if it is DNA 

if($_[0] =~ /*[ACGT]+$/i) { 

PEIne. 90) “\ay 





else { 
return 0; 


} 


} 
This produces the following output: 
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This statement gets executed 
Here's the subroutine call: 
ACGT 


As you see, even though the subroutine definition is buried in a block that's never entered, not 
even once, it is still available to the program. Perl scans the program at compile-time, reads in 
any subroutine definition no matter where it is, and the subroutine definition is then available 
to be called from anywhere in the program at runtime. 


Continuing on, in the code from Gene2.pm under consideration, there's the variable definition: 
my $ count = 0; 

which occurs outside the following subroutine definitions such as: 

sub _incr count { 


++$ count; 
} 


The variable $_count is declared outside the subroutine _incr count, but the subroutine 
uses the variable. Therefore, by definition, the subroutine _incr count is a closure. 


There's just one more piece to the puzzle. Consider again the code fragment from Gene2.pm, 
which I repeat here: 


# Class data and methods, that refer to the collection of all objects 
# in the class, not just one specific object 

{ 
my $ count = 0 
ub get _count 
$ count; 


~ Ne 


n 


n 


ub aner count -{ 
++$ count; 





n 


ub _decr_ count { 
==$_ county 








} 


It seems that when the program leaves the block that encloses this code, the variable 
$ count should go out of scope and no longer be available to the program. However, in 
Gene2.pm the $ count variable doesn't cease to exist. 


Because the subroutine definitions in this block are global, and because they also reference 
the variable $_ count, Perl knows that at any point in the program you can put in a call to, 
say, get_count, which in turn needs the variable $_count to execute. Perl doesn't cause the 
variable $ count to cease to exist because it sees the closures and avoids destroying the 
variable they reference at runtime. At any point in the program, the value of $_ count can be 
obtained by calling the subroutine. However, the value of $ count can't be accessed in any 
other way than by get_count or other closure defined within the same block. 


To summarize, by defining a variable and a closure that uses that variable within a block, a 
program can limit access to that variable to calls by the closures. This is exactly what I want 
to do in setting up class methods that refer to the collection of all objects that are in use. 


In Gene2.pm, I want to initialize the count of objects to 0 when the program starts and then 
increment it by one each time a new object is created. By defining incr count as a closure, 
I can call it from within the new object constructor, ensuring that the variable $_ count will 
keep an accurate count of the number of objects that are created. 


3.7.2 Tracking Class Data from the Constructor Method 


In this second version of the class, I just have to make a small change to the constructor 
method, the subroutine new. 
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Here is the modified new method constructor: 


# The constructor for the class 
sub new { 





my ($class, %arg) = @ ; 
my Sself = bless { 
_ name => Sarg{name} || croak("Error: no name"), 
_ organism => Sarg{organism} || croak("Error: no organism"), 
_chromosome => Sarg{chromosome}|| "????", 
_pdbref => Sarg{pdbref} a. See, 


ig Sclass; 
$class->_incr_count( ); 
return Sself; 


} 


First, I create the object by blessing (and initializing) an anonymous hash, as before. This 
time, however, I'll save the object as the local variable $self. This allows me to add a call to 
the class method incr count in order to keep track of the total number of objects created. 
I'll then return the object $self from the subroutine. 


3.7.3 Accessor and Mutator Methods 


In the first version of Genel .pm, I printed the values stored in an object by accessing simple 
methods such as get_name. 


In this new version of Gene2.pm, I have the same specific methods for each attribute for 
which I may want to see the value. I also include mutators, which are subroutines that enable 
the user of the class to alter the values of attributes of an object. 


Here are the accessor and mutator methods for Gene2.pm: 


# Accessors, for reading the values of data in an object 


sub get_name { $_[0] -> {_name} } 
sub get_organism { $_[0] -> {_organism} } 
sub get_chromosome { $ [0] -> {_ chromosome} } 
sub get_pdbref { $_[0] -> {_pdbref} } 


# Mutators, for writing the values of object data 
sub set name { 

my ($self, Sname) = @ ; 

$self -> { name} = $name if $name; 








sub set organism { 
my ($self, Sorganism) = @ ; 
$self -> {_organism} = Sorganism if Sorganism; 


sub set chromosome { 
my ($self, Schromosome) = @ ; 
$self -> { chromosome} = $chromosome if $chromosome; 


sub set_pdbref { 

my ($self, Spdbref) = @ ; 

$self -> {_pdbref} = Spdbref if Spdbref; 
} 


The mutators collect two arguments. The first is the reference to the object, which as before, 
is passed automatically to the method when it is invoked (using the method set_name as an 
example): 

Sobj->set_name('hairy'); 

The second argument collected is then the first argument given to the call, in this case, setting 
the gene name to hairy. 
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The work of the subroutine is accomplished by the line: 
Sself -> { name} = $name if $name; 


It simply sets the internal name attribute to the supplied name (hairy in this example) if the 
argument $name is supplied. If it's not supplied, the subroutine does nothing. 


Again, you see that the internal representation of the attributes of the object are hidden from 
the class's user. Altering an object's attributes is done with methods; the class author is then 
free to alter the way in which the attributes are stored, without changing the Application 
Programming Interface (API), the interface of the class to the outside world. If you use this 
class, you don't have to change your code when a new version of the class is written. 


The test program testGene2 is similar to testGene1, with the addition of examples of the 
class mutators. 


[ 
Tea 

m 

LiB 


Page 111 


[Team LiB ] 


3.8 Gene3.pm: A Third Example of a Perl Class 


We've gone through two iterations building an OO class with the Genel.pm and Gene2.pm 
modules. Now, let's add a few more features and create the Gene3.pm module as a 
penultimate example for this introduction to OO programming in Perl. 


Here is the code for Gene3.pm and for the test program testGene3; also included is the 
output produced by running testGene3. Following the code will be a discussion of the new 
features of this third version of our example class. But I'll point out before you read on, that 
AUTOLOAD is a special name in Perl for a subroutine that will handle a call to any undefined 
subroutine in a class. (I'll give more details after you look at the code.) 


package Gene3; 
A third version of the Gene.pm module 


use strict; 

use warnings; 

our SAUTOLOAD; # before Perl 5.6.0 say "use vars 'SAUTOLOAD';" 
use Carp; 


# Class data and methods, that refer to the collection of all objects 
in the class, not just one specific object 








{ 


my $ count = 0; 
sub get count { 
$count; 


n 


ub incr count { 
++$ count; 





n 


ub _decr_ count { 
--$_ count; 








} 


# The constructor for the class 
sub new { 





my ($class, %arg) = @ ; 

my Sself = bless { 
_ name => Sarg{name} || croak("Error: no name"), 
_organism => Sarg{organism} || Groak("Eeror: no organism"), 
_chromosome => Sarg{chromosome}|| "????", 
_pdbref => Sarg{pdbref} Li eee 5 
_author => Sarg{author} [fill aos 
_date => Sarg{date} [i Meer e", 


hig Sclass; 
$class->_incr_count( ); 
return Sself; 


} 





# This takes the place of such accessor definitions as: 
# sub get_attribute { ... } 
# and of such mutator definitions as: 
# sub set_attribute { ... } 
sub AUTOLOAD { 
my ($self, S$newvalue) = @ ; 
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my (Soperation, Sattribute) = (SAUTOLOAD =~ /(get|set) (_\wt)$/) 


# Is this a legal method name? 
unless(Soperation && Sattribute) { 
croak "Method name SAUTOLOAD is not in the recognized form 
attribute\n"; 
} 
unless(exists Sself->{Sattribute}) { 


croak "No such attribute 'Sattribute' exists in the class " 


ref ($self); 
} 


# Turn off strict references to enable "magic" AUTOLOAD speedup 





no strict *refe'> 


# AUTOLOAD accessors 
if(Soperation eq 'get') { 
# define subroutine 
*{SAUTOLOAD} = sub { shift->{Sattribute} }; 


# AUTOLOAD mutators 
}elsif (Soperation eq 'set') { 
# define subroutine 
*{SAUTOLOAD} = sub { shift->{Sattribute} = shift; }; 


# set the new attribute value 
Sself->{Sattribute} = Snewvalue; 





# Turn strict references back on 
use strict 'refs'; 


# return the attribute value 
return $self->{S$attribute}; 
} 


, 


(get|set) 


, 


# When an object is no longer being used, this will be automatically called 


# and will adjust the count of existing objects 
sub DESTROY { 

my($self) = @ ; 

Sselft-> décr -count( )¢ 





} 


# Other methods. They do not fall into the same form as the majority handled 


by 
AUTOLOAD 
# This is an example of a method that is both accessor and mutator, 
on the 
# number of arguments provided to it. 
sub -ecltation { 
my ($self, Sauthor, Sdate) = @ ; 
$self->{_ author} = set_author(Sauthor) if Sauthor; 
Sself->{_ date} = set _date(S$date) if $date; 
return ($self->{ author}, $self->{_date}) 





1; 
3.8.1 Testing Gene3.pm 


Here is the test program testGene3 for the Gene3.pm class: 
#!/usr/bin/perl 


depending 
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Test the third version of the Gene module 


use strict; 
use warnings; 





Change this line to show the folder where you store Gene.pm 
use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 
use Gene3; 





print “Object 1i\n\n"; 


# Create first object 
my Sobj1l = Gene3->new ( 


name => "Aging", 
organism => "Homo sapiens", 
chromosome => "23", 

pdbref => "pdb9999.ent" 


)3 


# Print the attributes of the first object 
print Sobjl->get_name, "\n"; 

print Sobjl->get_organism, "\n"; 

print $Sobjl->get_chromosome, "\n"; 

print Sobjl->get_pdbref, "\n"; 


# Test AUTOLOAD failure: try uncommenting one or both of these lines 
#print Sobj1->get_exon, "\n"; 
#print Sobj1->getexon, "\n"; 
print “"\n\nObject 22\n\n"; 
# Create second object 
my Sobj2 = Gene3->new ( 
organism => "Homo sapiens", 
name => "Aging", 
\e 
# Print the attributes of the second object ... some will be unset 


print $obj.2=>get_nmame;, “\n"; 

print Sobj2->get_organism, "\n"; 
print $obj2->get_chromosome, "\n"; 
print Sobj2->get_pdbref, "\n"; 


# Reset some of the attributes of the second object 
Sobj2->set_name ("RapidAging") ; 
Sobj2->set_chromosome ("22q") ; 

S$obj2->set_pdbref ("pdf£9876.ref") ; 
S$obj2->set_author("D. Enay"); 

S$obj2->set_date ("February 9, 1952"); 





print "\a\n"? 

# Print the reset attributes of the second object 
print Sobj2->get_name, "\n"; 

print Sobj2->get_organism, "\n"; 

print S$obj2->get_chromosome, "\n"; 

print $obj2=>get_pdbref,; "\n"; 

print Seb ]2=Scitation, “\w"; 


# Use a class method to report on a statistic about all existing objects 





print "\nCoune. 16", Genes-=>get count, "\n\nl? 
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print "\n\nObject 3:\n\n"; 


# Create a third object: but this fails 








# because the "name" value is required (s Gene .pm) 
my Sobj3 = Gene3->new ( 

organism => "Homo sapiens", 

chromosome => 2, 

pdbref => "pdb9999.ent" 


)3 


# This line is not reached due to the fatal failure to 
# create the third object 
print "\nCoumt as", Genmes=>ger count, “\n\nany 


Finally, here is the output from running the test program testGene3: 


Object 1: 


Aging 

Homo sapiens 
23 
pdb9999.ent 


Object 2: 


Aging 

Homo sapiens 
2??? 

2222 


RapidAging 

Homo sapiens 

22g 

pdf9876.ref 

D. EnayFebruary 9, 1952 





Count is 2 


Object 3: 





Error: no name at testGene3 line 70 
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3.9 How AUTOLOAD Works 


The AUTOLOAD mechanism, built into the definition of Perl packages, is simple to use. If a 
subroutine named AUTOLOAD is declared within a package, it is called whenever an undefined 
subroutine is called within the package. AUTOLOAD is a special name, and must be capitalized 
as shown, because Perl is designed that way. Don't use the subroutine name AUTOLOAD (or 
DESTROY) for any other purpose, or you'll suffer unintended consequences. 





Without an AUTOLOAD subroutine defined in a package, an attempt to call some undefined 
subroutine simply produces an error when the program runs. But if an AUTOLOAD subroutine is 
defined, it is called instead and is passed the arguments of the undefined subroutine. At the 
same time, the SAUTOLOAD variable is set to the name of the undefined subroutine. 

Here's an example of a short Perl program that tries to call an undefined function: 
#!/usr/bin/perl 


use strict; 
use warnings; 


print "I started the program\n"; 
report protein function("one", "two"); 


print "I got to the end of the program\n"; 


It gives the following output: 


I started the program 
Undefined subroutine &main::report protein function called at jk.pl line 8. 


Here's what happens when an AUTOLOAD subroutine is defined in the package: 
#!/usr/bin/perl 
Wee Sieiery 


use warnings; 
use vars 'SAUTOLOAD'; 


print "I started the program\n"; 
report. protein function ("one", “two"); 
print "I got to the end of the program\n"; 


sub AUTOLOAD { 
print "AUTOLOAD is set to SAUTOLOAD\n"; 
print "with arguments ", "@ \n"; 

} 

It gives the following output: 


I started the program 

AUTOLOAD is set to main::report protein function 
with arguments one two 

I got to the end of the program 


3.9.1 Defining Global Variables 


Recall that when you start programs with such statements as: 


use strict; 
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you have to declare all variables as lexically scoped using my. However, there are times when 
your program needs to use global variables that aren't lexically scoped. To use AUTOLOAD, you 
need access to the predefined S$AUTOLOAD global variable. 


To enable access of the package global SAUTOLOAD, you must specifically exempt it from the 
use strict injunction. This can be accomplished with the use vars statement: 


use vars 'SAUTOLOAD'; 


Other globals can be declared in this way as well, but globals should be used sparingly, and 
preferably not at all. 


Newer versions of Perl (after Version 5.6.0) have a cleaner way to declare global variables 
even when use strict is in effect: 


our SAUTOLOAD; 


This makes the variable SAUTOLOAD a legal global within the scope in which it is declaredin 
Gene3.pm, the scope is the entire class. 


Without our SAUTOLOAD or use vars 'SAUTOLOAD', the program won't run; instead, it 
complains vociferously that: 


Global symbol "SAUTOLOAD" requires explicit package name 


3.9.2 AUTOLOAD Simplifies Writing Methods 


Having the AUTOLOAD mechanism available can greatly simplify the writing of class methods. 
Many classes require methods to examine and to change the values of attributes, as have the 
two previous versions Genel.pm and Gene2.pm. 


If an object has many attributes, you have to write an accessor method and a mutator 
method for each attribute. This is repetitive; it requires defining more methods every time the 
list of attributes changes, and, in general, it's hard to maintain such code. 


The new version Gene3.pm useS AUTOLOAD to automate the handling of methods for 
accessors and mutators. All you need do is write the one AUTOLOAD subroutine, and all these 
similar, basic methods are handled in the same fashion by the one bit of code. 


3.9.2.1 Bypassing use strict 


AUTOLOAD starts by fiddling with the use strict statement. Just as it requires the SAUTOLOAD 
global variable to be exempted from the use strict directive, so does the magic AUTOLOAD 

speedup (described in the next section) require an exemption from the use strict directive 
at a specific place within the AUTOLOAD subroutine. Thus, the statement: 


no strict.“ rere”; 


turns off the use strict where required. This enables the lines (to be explained later) such 
as: 


*{SAUTOLOAD} = sub { return $ [0]->{Sattribute} }; 


to bypass the otherwise desirable use strict instruction. 
3.9.2.2 AUTOLOAD arguments 


Recall that AUTOLOAD is automatically used when 1) it has been defined, and 2) an undefined 
subroutine is called. When this happens, AUTOLOAD is simply passed the arguments that would 
have gone to the undefined subroutine. 


For example, say you call an undefined method fold on an object Speptide: 
Speptide->fold(-style => 'prion') 


If you define an AUTOLOAD method in the class, it's called and passed the calling object or class 
name, as usual, plus the arguments -style => 'prion' you were trying to pass to the 
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nonexistent fold method. The global scalar variable SAUTOLOAD is also set to the name of the 
nonexistent fold method. 


The version of AUTOLOAD in Gene3.pm captures one written argument. So, of course, this 
AUTOLOAD actually captures two arguments: the class object automatically passed into the 
subroutine by arrow notation, which appears first, and the other arguments, if any. This line in 
the AUTOLOAD subroutine: 

my ($self, Snewvalue) = @ ; 


assigns the reference to the object to the new variable $self and the value to be set, if any, 
to the new variable $newvalue. 


3.9.2.3 Using naming conventions to write code: get_ and set_ 


The various versions of the Gene module have named attributes with beginning underscores, 
for example, _name for the gene name. The accessors and mutators for attributes have been 
assigned names that prepend get and set to the beginning of the attribute name, for 
example, get_name and set_name. 





In Gene3.pm, the AUTOLOAD subroutine elevates this convention to an enforced discipline, by 
recognizing only method names and attribute names that conform to this convention. It first 
examines the name of the called subroutine as stored in the SAUTOLOAD global variable, 
checks if the subroutine name is in the expected form, and if so, extracts the attribute name 
from the subroutine name with a regular expression. The AUTOLOAD subroutine then checks 
that the requested attribute exists, and fetches or sets the value of that attribute. 


The first part of the AUTOLOAD subroutine does some checking to see if the subroutine name 
is in the expected form, and if so, it extracts the attribute name, and the requested operation 
(get or set). This first test: 


my (Soperation, Sattribute) = (SAUTOLOAD =~ /(get|set) (_\wt)$/); 


# Is this a legal method name? 
unless(Soperation && Sattribute) { 

croak "Method name SAUTOLOAD is not in the recognized form 
(G6t| Set) attribute a"; 
} 
unless(exists S$self->{Sattribute}) { 

croak "No such attribute 'Sattribute' exists in the class ", ref(S$self); 


} 
uses a regular expression to see if the SAUTOLOAD variable is storing a method name that 


ends with an attribute name (complete with leading underscore) that is defined for objects of 
this class if it begins with get or set as the desired operation. The regular expression: 





(get|set) (_\w+)$ 


looks for a name that, after get or set, is composed of an underscore followed by one or 
more legal word characters (as described in the perlre manpage on regular expressions): 


\wt 


Here, the underscore matches an underscore, and the \w matches any legal word character, 
and the + matches one or more such word characters. These are remembered and captured 
in the Soperation and Sattribute variables by surrounding with parentheses the parts of 
the regular expression that match the operation and the attribute name: 


(get|set) (_\wt) 


This attribute name is assigned to the variable $attribute (for obvious mnemonic reasons) 
to use in the rest of the subroutine. Similarly, the operation get or set is assigned to the 
Soperation variable. 


The second part of the test checks to see if such an attribute name exists in the hash that 
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represents the class object: 


unless(exists Sself->{Sattribute}) { 
croak "No such attribute 'Sattribute' exists in the class ", ref($self); 


} 


The exists Perl command checks to see if a hash key exists; the value for the key may not 
have been set, but the key must exist. $self is the reference to the class object, so the 
following: 


exists Sself->{Sattribute} 


checks to see if any such attribute actually exists in the object. 


If the method name passed to AUTOLOAD begins with get or set, ends with a name including a 
leading underscore, and if that name is an existing key in the hash that is the class object, the 
tests will succeed. If they fail, the program will croak at this point. 


3.9.2.4 AUTOLOAD accessors 


The next bit of AUTOLOAD code handles the calls to class accessors: 


# AUTOLOAD accessors 
if (Soperation eq 'get') { 

# define subroutine 

*{SAUTOLOAD} = sub { shift->{Sattribute} }; 
} 


The code first determines that a get accessor was wanted. Then the undefined accessor 
method (whose name has been saved in the variable SAUTOLOAD) is defined. The subroutine 
definition is placed in the program's symbol table with * {SAUTOLOAD}. The new subroutine 
gets the object from the arguments by the call to shift. The object is a hash, and the value 
in the hash for the attribute is returned from the subroutine. So this method is a simple 
accessor, that, given an attribute name, returns the value. This accessor isn't actually used 
here; it's just defined in the symbol table. 


3.9.2.5 AUTOLOAD mutators 


The next bit of AUTOLOAD code handles the calls to class mutators: 


# AUTOLOAD mutators 
}elsif (Soperation eq 'set') { 
# define subroutine 
*{SAUTOLOAD} = sub { shift->{Sattribute} = shift; }; 


ribute value 
} = Snewvalue; 


# set the new att 
Sself->{Sattribut 








} 


Here, after determining that a set mutator method was called, the undefined mutator 
method (whose name has been saved in the variable SAUTOLOAD) is defined. The new 
subroutine gets the object from the arguments by the first call to shift and sets the attribute 
of the object to the new value, which it gets from the arguments by the second call to shift. 
After defining the new mutator method, the code actually sets the attribute key to the 
Snewvalue that was passed in as an argument. 


Finally, the AUTOLOAD program, after defining the new accessor or mutator method, as the 
case may be, and setting the new value of the attribute if a mutator method has been 
defined, returns the value of the attribute: 


# return the attribute value 
return $self->{Sattribute}; 


So the AUTOLOAD method both defines the accessor or mutator methods and behaves just 
like the defined accessor or mutator method by returning the attribute value (if it's a mutator, 
it first resets the attribute). 
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3.9.2.6 AUTOLOAD speedup 
The so-called "magic" lines in the accessor and mutator code that I've referred to: 
*{SAUTOLOAD} = sub { shift->{Sattribute} }; 


and: 
*{SAUTOLOAD} = sub { shift->{Sattribute} = shift; }; 





are there purely in order to speed up the code. 


AUTOLOAD performs its tasks a bit on the slow side. For a large program that does a lot of 
getting and setting of attributes, the slowdown is noticeable. What is saved in programming 
time by having AUTOLOAD handle all these accessors and mutators, is lost in runtime. The 
slowdown comes from the program having to figure out what is wanted by the undefined 
methods, the use of regular expressions to parse the names of the methods, etc. 


The magic lines actually define the new methods in the symbol table, on the fly, when they 
don't already exist. (The * gives access to the symbol table, but I'll omit the details of how 
the symbol table is defined and manipulated and stick to practicalities here.) After they are 
called once, and the AUTOLOAD overhead is incurred, the methods are thenceforth defined in 
the symbol table of the running program. So, for instance, the second time that the accessor 
method get_name is called, the program finds the definition in the symbol table, and AUTOLOAD 
isn't called. This results in a considerable speedup for the program overall. 


I'll not delve too deeply into how this works. Briefly, the SAUTOLOAD variable contains the 
name of the desired method call, say, get_name. The star * in * {SAUTOLOAD} is a reference to 
the definition of that method call in the symbol table. This symbol table reference is assigned 
the part of the expression to the right of the assignment sign (=) that's an (anonymous) 
subroutine definition. 


The symbol table is thus manipulated directly from your program, and the missing accessor 
and mutator definitions are installed in the symbol table the first time AUTOLOAD is called to 
handle them. After this first call that invokes AUTOLOAD, the program can find the method 
definitions in the symbol table and uses those definitions, bypassing AUTOLOAD. For more 
details, see O'Reilly's Programming Perl. 
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3.10 Cleaning Up Unused Objects with DESTROY 


When a running program no longer needs a portion of computer memory, what happens to 
it? How is the program's memory managed? Various possibilities exist, and different 
languages handle the problem in different ways. For instance, the designer of the language 
can just leave the memory as it is, unused, and go on and use other memory for other tasks. 
No clean up is strictly necessary. 


However, this might, and does, cause problems with certain kinds of programs. Some 
programs read in large amounts of data into their memory, perhaps extract some statistics 
on the data, and then go on to the next large chunk of data to repeat the same operation. A 
computer's memory is finite; for a program that runs a long time and examines a continuous 
source of data (say, for instance, the data generated by your sequencing facility), it will at 
some point use all available main memory. 


It is necessary to consider how to clean up memory that is no longer used, so it can be 
reused by the program. This is sometimes called the garbage collection problem. 
Consideration of this problem has resulted in many approaches and a large amount of 
literature, which won't be discussed here. 


However, sometimes there are practical considerations. In the class module Gene.pm, I'm 
keeping count of all objects that are created by the running program. In Perl, when a variable 
is no longer used, its memory is automatically cleaned up. One such instance is when a 
variable goes out of scope. For instance, in the following code fragment, the variable $i goes 
out of scope after the if block, and its memory is cleaned up, making it available to the rest 
of the program: 
a6 (iL): <4 

my $i = 'ACCGGCCGGCCGGTTAATGCATAATC'; 

determine function ($i); 


} 


# Si has gone out of scope here 


This problem actually affects the Gene.pm module. Say you create a new object, and as the 
program continues, the object goes out of scope. For instance, if the object was created 
within a block, and the program leaves the block, the object is then out of scope. Perl will 
remove the part of memory that held the object, and all will be well... except that the global 
count of the number of objects will now be off by one! 


What is needed is a way to automatically call a bit of code to adjust the global count 
whenever an object goes out of scope. Perl provides such a mechanism with the DESTROY 
subroutine. Perl calls the DESTROY method 1) if you've defined a method with that name in 
your class, and 2) a class object (a reference blessed with the name of the class) goes out 
of scope. It does so automatically, just as AUTOLOAD is automatically called if you attempt to 
call a method that doesn't exist on a class object. 








In our program, the only thing keeping track of when an object goes out of scope and is 
garbage collected by Perl is the global count of existing objects. This simple DESTROY 
subroutine will thus suffice: 
sub DESTROY { 

my ($self) = @ ; 

$self->_decr_count( ); 








} 
Let's see if it works. Here's a test program, testGeneGc (GC for garbage collection): 
#!/usr/bin/perl 
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Test the garbage collection of the Gene.pm module 


use strict; 
use warnings; 


# Change this line to show the folder where you store Gene. 
use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 
use Gene; 








print "\nCount is ", Gene-Sget_count, "\n\n"; 
rec) f 
# Create first object 
my Sobj1l = Gene->new ( 
name => "Genel", 
organism => "Homo sapiens", 
); 
prank "\nCount as: Geme=>get count, “\n\n"; 
# Create second object 
my Sobj2 = Gene->new ( 
name => "Gene2", 
organism => "Homo sapiens", 
); 
print." \nCount is", Gene=>get count, "\n\n"; 
# Create a third object 
my Sobj3 = Gene->new ( 
name => "Gene3", 
organism => "Homo sapiens", 


Me 


print "\nCount is ", Gene->get_count, "\n\n"; 


} 


print "\nCount is ", Gene->get_count, "\n\n"; 


pm 


This produces an output that shows that once the three objects created in the scope of the 


if block go out of scope, the count is properly set back to zero: 
is 
is 
is 
is 
is 


Count 
Count 
Count 
Count 
Count 


@ @ NM © 





As a further test, let's try taking the definition of the DESTROY subroutin 
Now, try the test program testGeneGc to get the following output: 





Count is 0 
Count is 1 
Count is 2 
Count. AS 3 
Count, aS 3 


As you see, the last line still has a count of three Gene objects, when th 
still within scope. To properly keep class-wide data on all objects, the D! 
sometimes a necessity. 





For more details on D! 


e out of Gene.pm. 


ere are in reality none 
ESTROY subroutine is 





ESTROY, including discussions of how to clean up more complicated data 
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structures, see Section 3.14. 
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3.11 Gene.pm: A Fourth Example of a Perl Class 


We've now come to the fourth and final version of the Gene class, Gene.pm. This final version 
adds a few more bells and whistles to make the code more reliable and useful. You'll see how 
to define the class attributes in such a way as to specify the operations that are permitted on 
them, thus enforcing more discipline in how the class can be used. You'll also see how to 
initialize an object with class defaults or clone an already existing object. You'll see the 
standard and simple way in which the documentation for a class can be incorporated into the 
.pm file. This will conclude my introduction to OO Perl programming (but check out the 
exercises at the end of the chapter and see later chapters of this book for more ideas). 


3.11.1 Building Gene.pm 


Here then is the code for Gene. pm. Again, I recommend that you take the time to read this 
code and compare it to the previous version, Gene3.pm, before continuing with the discussion 
that follows: 


package Gene; 


n 
A fourth and final version of the Gene.pm class 


use strict; 

use warnings; 

our SAUTOLOAD; # before Perl 5.6.0 say "use vars 'SAUTOLOAD';" 
use Carp; 





Class data and methods 
{ 





# A list of all attributes with default values and read/write/required 
properties 
my % attribute properties = ( 
_name => [ '????', "read. required'], 
_organism => [ '2?2???', "read.required'], 
_chromosome => [ '????', 'read.write'], 
_pdbref => [ '?2???', 'read.write'], 
_author => Pe Rots 'read.write'], 
date => [ T2222", 'read.write'], 





he 


# Global variable to keep count of existing objects 
my $ count = 0; 





# Return a list of all attributes 
sub _all_ attributes { 


keys % attribute properties; 


} 


# Check if a given property is set for a given attribute 
sub permissions { 
my(Sself, Sattribute, $permissions) = @ ; 


$ attribute properties {Sattribute}[1] =~ /Spermissions/; 





} 


# Return the default value for a given attribute 
sub attribute default { 
my(Sself, Sattribute) = @ ; 

$_ attribute properties {Sattribute} [0]; 





Page 124 


} 


# Manage the count of existing objects 
sub get count { 
$count; 


a 


ub incr count { 
++$ count; 





n 








ub _decr_ count { 
—=§_ count; 


} 


# The constructor method 


# Called from class, e.g. $obj = Gene->new( ); 
sub new { 
my ($class, %arg) = @ ; 


# Create a new object 
my Sself = bless { }, Sclass; 











foreach my Sattribute ($self->_all_attributes( )) { 
E.g. attribute = " name", argument = "name" 
my (Sargument) = (Sattribute =~ /* _(.*)/); 


# If explicitly given 
if (exists Sarg{Sargument}) { 
Sself->{Sattribute} = Sarg{Sargument}; 
If not given, but required 
}elsif (S$self->_permissions (S$attribute, 'required')) { 
croak("No Sargument attribute as required"); 
Set to the default 
jelse{ 
S$self->{Sattribute} = $self->_attribute default (Sattribute) ; 








} 
} 
$class->_incr_count( ); 
return Sself; 


The clone method 

All attributes will be copied from the calling object, unless 
specifically overridden 

Called from an exisiting object, e.g. $cloned_obj = Sobj1l->clone( ); 
sub clone { 

my ($caller, %arg) = @ ; 

# Extract the class name from the calling object 

my Sclass = ref(Scaller); 

# Create a new object 

my Sself = bless { }, Sclass; 


He oH He HE 








foreach my Sattribute ($self->_all_attributes( )) { 
# E.g. attribute = " name", argument = "name" 
my(Sargument) = (Sattribute =~ /* (.*)/); 
# If explicitly given 
if (exists Sarg{Sargument}) { 
Sself->{Sattribute} = Sarg{Sargument}; 
# Otherwise copy attribute of new object from the calling object 
}else{ 
Sself->{Sattribute} = Scaller->{Sattribute}; 

















} 
} 
$self->_incr count( ); 
return Sself; 
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} 


it 
it 
i 
it 


Sub Gét attribute. {).2.4 
and of such mutator definitions as: 
sub set_attrabute { 2.« } 
sub AUTOLOAD { 
my ($self, Snewvalue) = @ ; 
my (Soperation, Sattribute) = (SAUTOLOAD =~ /(get|set) (_\wt)$/); 
# Is this a legal method name? 


This takes the place of such accessor definitions as: 





c 


nless(Soperation && Sattribute) { 


croak "Method name SAUTOLOAD is not in the recognized form (get|set) _ 


attribute\n"; 


} 


unless(exists S$self->{Sattribute}) { 


croak "No such attribute Sattribute exists in the class ", 


ref ($self); 


} 


# Turn off strict references to enable "magic" AUTOLOAD speedup 





no strict 'refs'; 


# AUTOLOAD accessors 
if(Soperation eq 'get') { 


# Complain if you can't get the attribute 
unless ($self-> permissions (S$attribute, 'read')) { 
croak "Sattribute does not have read permission"; 


} 


# Install this accessor definition in the symbol table 
*{SAUTOLOAD} = sub { 
my ($self) = @ ; 
unless (S$self->_permissions(Sattribute, 'read')) { 
croak "Sattribute does not have read permission"; 
} 
Sself->{Sattribute}; 
}; 


# AUTOLOAD mutators 
}elsif (Soperation eq 'set') { 


} 


# Complain if you can't set the attribute 
unless (S$self-> permissions (Sattribute, 'write')) { 
croak "Sattribute does not have write permission"; 


} 


# Set the attribute value 
Sself->{Sattribute} = $newvalue; 





# Install this mutator definition in the symbol table 
*{SAUTOLOAD} = sub { 
my ($self, Snewvalue) = @ ; 
unless (S$self-> permissions (Sattribute, 'write')) { 


croak "Sattribute does not have write permission"; 


} 
Sself->{Sattribute} = S$newvalue; 





hy 


# Turn strict references back on 
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use strict 'refs'; 


# Return the attribute value 
return $self->{S$attribute}; 
} 


# When an object is no longer being used, 


this will be automatically called 


# and will adjust the count of existing objects 


sub DESTROY { 
my($self) = @; 
Sself-> decr count( )7 





# Other methods. 





They do not fall into the same form as the majority handled 


by 
AUTOLOAD 
sub citation { 
my ($self, Sauthor, Sdate) = @ ; 
$self->{_ author} = set_author(Sauthor) if Sauthor; 
S$self->{_ date} = set _date(S$date) if $date; 
return ($self->{ author}, $self->{_date}) 
} 
Le 
=headl Gene 
Gene: objects for Genes with a minimum set of attributes 


=headl Synopsis 
use Gene; 


my Sgenel = Gene->new ( 





name => 'biggene', 
organism => 'Mus musculus’, 
chromosome => '2p', 

pdbref => 'pdb5775.ent', 
author => 'L.G.Jeho', 

date => 'August 23, 1989', 


i 





print "Gene name is ", 
print "Gene organism is ", 
print "Gene chromosome is ", 
print "Gene pdbref is ", 
print "Gene author is ", 
print "Gene date is ", 


S$genel->get_name( ); 
$genel->get_organism( ); 
Sgenel->get_chromosome( ); 
Sgenel->get_pdbref( ); 
Sgenel->get_author( ); 
S$genel->get_dat 


( dF 





Sclone = Sgenel->clone (name => 


$genel-> set chromosome ('2q'); 
$genel-> set_pdbref('pdb7557.ent'); 
$genel-> set _author('G.Mendel'); 
Sgenel-> set _date('May 25, 1865'); 





Sclone->citation('T.Morgan', 
print "Clone citation is ", 
=headl AUTHOR 


A kind reader 


"October 3, 


"biggeneclone'); 


1912")? 


Sclone->citation; 
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=headl COPYRIGHT 

Copyright (c) 2003, We Own Gene, Inc. 

=cut 

3.11.2 Defining Attributes and Their Behaviors 

This fourth version of Gene.pm does some additional things with the available attributes: 


e It collects them in their own hash, 3 attribute properties. This makes it easier to 
modify the class; you only have to add or delete attributes to this one hash, and the 
rest of the code will behave accordingly. 


e It enables you to specify default values for each attribute. In the Gene.pm class, I just 
specify the string ???? as the default for each attribute, but any values could be 
specified. 


e This attribute hash specifies, for each attribute, whether it is permitted to read or write 
it, and if it is required to have a nondefault value provided. 


Here is the hash that supports all this: 


# A list of all attributes with default values and read/write/required 








properties 
my % attribute properties = ( 
_name => [| T9222", "read. required'], 
_ organism => [ Tee22" , "read. required'], 
chromosome -=>. [| '2??2"; 'read.write'], 
_pdbref => [ T2222", 'read.write'], 
_author => [ '?2???', 'read.write'], 
_date => [ '????', ‘'read.write'], 


i 


Why have the read/write/required properties been specified? It's because sometimes 
overwriting an attribute may get you into deep water; for instance, if you have a unique ID 
number assigned to each object you create, it may be a bad idea to allow the user of the 
class to overwrite that ID number. Restricting the access to read-only forces the user of the 
class to destroy an unwanted object and create a new one with a new ID. It depends on the 
application you're writing, but in general, the ability to enforce read/write discipline on your 
attributes can help you create safer code. 


The required property ensures that the user gives an attribute a value when the object is 
created. I've already discussed why that is useful in earlier versions of the class; here, I'm just 
implementing it in a slightly different way. 


This way of specifying properties can easily be expanded. For instance, if you want to add a 
property no_overwrite that prevents overwriting a previously set (nondefault) value, just 
add such a string to this hash and alter the code of the mutator method accordingly. 


Now that we've got a fair amount of information about the attributes collected in a separate 
data structure, we need a few helper methods to access that information. 


First, you need a method that simply returns a list of all the attributes: 


# Return a list of all attributes 
sub all attributes: { 
keys S attribute properties; 


} 


Next, you'll want a way to check, for any given attribute and property, if that property is set 
for that attribute. The return value is the value of the last statement in the subroutine, which 
is true or false depending on whether or not the property $permissions is set for the given 
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attribute: 


# Check if a given property is set for a given attribute 
sub permissions { 
my(S$self, Sattribute, $permissions) = @ ; 
$_ attribute properties {Sattribute}[1] =~ /Spermissions/; 





} 


Finally, to set attribute values, you'll want to report on the default value for any given 
attribute. This returns the value of the last statement in the subroutine, which is the default 
value for the given attribute (this is a hash of arrays, and the code is returning the first 
element of the array stored for that attribute, which contains the default value): 
# Return the default value for a given attribute 
sub attribute default { 

7 my ($self, Sattribute) = @ ; 

$ attribute properties{$attribute} [0]; 
} 


3.11.3 Initializing the Attributes of a New Object 


This fourth and final version of Gene.pm has some alterations to the new constructor method. 
These alterations incorporate tests and actions relating to the new information being specified 
about the attributes, namely, their default values and their various properties. 


I've also added an entirely new constructor method, clone. Recall that the new constructor 
method is called as a class method (e.g., Gene->new( )) and uses default values for every 
attribute not specified when called. It is often useful to create a new object by copying an old 
object and just changing some of its values. clone gives this capability. It is called as an 
object method (e.g., Sgeneobject->clone( )). 


Let's examine the changes that were made to the new constructor; then we'll look at the 
clone constructor. 


3.11.3.1 The newer new constructor 


Here is the new version of the code for the new constructor: 


# The constructor method 


# Called from class, e.g. S$obj = Gene->new( ); 
sub new { 
my ($class, %arg) = @ ; 





# Create a new object 
my Sself = bless { }, Sclass; 








foreach my Sattribute ($self->_all_attributes( )) { 
E.g. attribute = "name", argument = "name" 
my(Sargument) = (Sattribute =~ /* (.*)/); 


If explicitly given 
if (exists Sarg{Sargument}) { 
Sself->{Sattribute} = Sarg{Sargument}; 
If not given, but required 
}elsif (S$self->_permissions (S$attribute, 'required')) { 
croak("No Sargument attribute as required"); 
# Set to the default 
jelse{ 
$self->{Sattribute} = $self->_attribute default (Sattribute) ; 








} 
} 
$class->_incr_count( ); 
return Sself; 


} 
Notice that we start by blessing an empty anonymous hash: bless { }, and then setting the 


Page 129 


values of the attributes. 


These attribute values are set one by one, looping over their list given by the new helper 
method all attributes. Recall that the attribute names start with an underscore, which 
indicates they are private to the class code and not available to the user of the class. Each 
attribute is associated with an argument that has the same name without the leading 
underscore. 


The logic of attribute initialization is three part. If an argument and value for an attribute is 
given, the attribute is set to that value. If no argument/value is given, but a value is required 
according to the properties specified for that attribute, the program croaks. Finally, if no 
argument is given and the attribute isn't required, the attribute is set to the default value 
specified for that attribute. 


As before, at the end of the new constructor, the count of objects is increased, and the new 
object is returned. 


3.11.3.2 The clone constructor 


The clone constructor is very similar to the new constructor. In fact, the two subroutines 
could be combined into one without much trouble. (See the chapter exercises.) However, it 
makes sense to separate them, especially since it makes it clearer what's happening in the 
code that uses these subroutines. Besides, you just have to figure that the special ability to 
clone objects will come in handy in bioinformatics! 


Here is the code for the clone constructor: 


# The clone method 
# All attributes will be copied from the calling object, unless 
# specifically overridden 





# Called from an exisiting object, e.g. $cloned_obj = Sobj1l->clone( ); 
sub clone { 

my ($caller, %arg) = @ ; 

# Extract the class name from the calling object 

my Sclass = ref (Scaller); 





# Create a new object 
my Sself = bless { }, Sclass; 





foreach my Sattribute ($self->_all_attributes( )) { 
# E.g. attribute = " name", argument = "name" 
my (Sargument) = (S$attribute =~ /* (.*)/); 


# If explicitly given 
if (exists Sarg{Sargument}) { 
Sself->{Sattribute} = Sarg{Sargument}; 
# Otherwise copy attribute of new object from the calling object 
jelse{ 
Sself->{Sattribute} = Scaller->{Sattribute}; 











} 

} 

Sself-> iner count( )7 

return Sself; 
} 
Notice, first of all, that this method is called from an object, in contrast to the new 
constructor, which is called from the class. That is, to create a new object, you say 
something like: 


Snewobject = Myclass->new( ); 


As usual, the class Myclass is named explicitly when calling the new constructor. 


On the other hand, to clone an existing object, you say something like: 
Sclonedobject = Snewobject->clone( ); 
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in which the clone constructor is called from an already existing object, in this case, the 
object Snewobject. 


Now, in the code for the clone method, the class name must be extracted from the caller by 
the ref ($caller) code because the caller is an object, not a class. 


Next, as in the new constructor, an empty anonymous hash is blessed as an object in the 
class, and then each attribute is considered in turn in a foreach loop. 


Now, the argument name associated with the attribute name is extracted. Here, a simpler 
two-stage test is made. As before, if the argument is specified, the attribute is set as 
requested. If not, the attribute is set to the value it had in the calling object. Finally, the count 
of objects is incremented, and the new object is returned. 


These two constructors give you some flexibility in how new objects are created and 
initialized in the Gene class. This flexibility may prove convenient and useful for you. 


3.11.4 Permissions 


The code to AUTOLOAD has been augmented with checks for appropriate permissions for the 
various attributes. The part of the code that handles the get_ accessor methods now checks 
to see if the read flag is set in the attribute hash via the permissions class method. Notice 
the code that installs the definition of an accessor into the symbol table has also been 
modified to accommodate this additional test: 


# AUTOLOAD accessors 
if (SAUTOLOAD =~ /.*::get_\wt/) { 
# Install this accessor definition in the symbol table 
*{SAUTOLOAD} = sub { 
my ($self) = @ ; 
unless (S$self-> permissions (Sattribute, 'read')) { 
croak "Sattribute does not have read permission"; 
} 
Sself->{Sattribute}; 
}; 
# Return the attribute value 
unless (S$self-> permissions (Sattribute, 'read')) { 
croak "Sattribute does not have read permission"; 





} 
return $self->{Sattribute}; 


} 


Similarly, the part of AUTOLOAD that defines mutator methods for setting attribute values now 
checks for write permissions in a similar fashion. 


3.11.5 Gene.pm Test Program and Output 


Here is a test program testGene that exercises some of the new features of Gene.pm, 
followed by its output. It's worthwhile to take the time to read the testGene program, 
looking back at the class module Gene. pm for the definitions of the objects and methods and 
seeing what kind of output the test program creates. Also, see the exercises for suggestions 
on how to further modify and extend the capabilities of Gene. pm. 


#!/usr/bin/perl 


nm 
# Test the fourth and final version of the Gene module 


use strict; 
use warnings; 
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# Change this line to show the folder where you store Gene.pm 
use lib "/home/tisdall/MasteringPerlBio/development/lib"; 





use Gene; 
print “Object Li\n\n"; 


# Create first object 


my Sobjl = Gene->new ( 
name => "Aging", 
organism => "Homo sapiens", 
chromosome => O38", 
pdbref => "pdb9999.ent" 


); 


# Print the attributes of the first object 
print Sobjl->get_name, "\n"; 

print Sobjl->get_organism, "\n"; 

print Sobjl->get_chromosome, "\n"; 

print Sobjl->get_pdbref, "\n"; 


# Test AUTOLOAD failure: try uncommenting one or both of these 


#print Sobj1->get_exon, "\n"; 
#print Sobj1->getexon, "\n"; 


print "\n\nObject 2:\n\n"; 


# Create second object 

my Sobj2 = Gene->new ( 
organism => "Homo sapiens", 
name => "Aging", 

\; 


# Print the attributes of the second object 
print Sobj2->get_name, "\n"; 

print Sobj2->get_organism, "\n"; 

print S$obj2->get_chromosome, "\n"; 

print $obj2=->get pdbref, "\n"; 


some will be unset 


# Reset some of the attributes of the second object 





# set name will cause an error 
#$obj2->set_name ("RapidAging") ; 
Sobj2->set_chromosome ("22q") ; 
Sobj2->set_pdbref ("pdf£9876.ref") ; 
Sobj2->set_author("D. Enay"); 
Sobj2->set_date ("February 9, 1952"); 





print “\n\a"? 

# Print the reset attributes of the second object 
print Sobj2->get_name, "\n"; 

print Sobj2->get_organism, "\n"; 

print Sobj2->get_chromosome, "\n"; 

print Sobj2->get_pdbref, "\n"; 

print Sobj2=>Scitation, "\n"; 


# Use a class method to report on a statistic about all existing objects 


print "\nCount. is. ", Gene=>get count; “\n\n"; 





print "Object 3: a clone of object 2\n\n"; 


# Clone an object 

my Sobj3 = S$obj2->clone ( 
name => "screw", 
organism => "C.elegans", 


lines 
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author => "TL. Turn", 
)e 





# Print the attributes of the cloned object 





print $obj3->get_name, "\n"; 

print S$obj3->get_organism, "\n"; 

print $obj3->get_chromosome, "\n"; 

print S$obj3->get_pdbref, "\n"; 

Print Sob ]3=>ertation, "“\n"s 

print "“\nCount is ", Gene->get count, “\o\n"; 
print "\n\nObject. 4: \n\n"; 


# Create a fourth object: but this fails 








# because the "name" value is required (s Gene.pm) 
my Sobj4 = Gene->new ( 

organism => "Homo sapiens", 

chromosome => N23", 

pdbref => "pdb9999.ent" 


)3 


# This line is not reached due to the fatal failure to 
# create the fourth object 
print "\nCount is ", Gene->get_count, "\n\n"; 


Here is the output from running the preceding program: 
Object 1: 


Aging 

Homo sapiens 
23 
pdb9999.ent 


Object 2: 


Aging 

Homo sapiens 
2??? 

2??? 


Aging 

Homo sapiens 

22g 

pdf9876.ref 

D. EnayFebruary 9, 1952 





Count is 2 

Object 3: a clone of object 2 
screw 

C.elegans 

229 

pdf9876.ref 

I.TurnFebruary 9, 1952 


Count is 3 


Object 4: 


No name attribute as required at testGene line 89 





Page 133 


i 


i 


aa 


4 PREVIOUS 
NEXT > 


Page 134 


[Team LiB J 


3.12 How to Document a Perl Class with POD 


An essential part of programming is documentation. Comments in the code are an important 
part of documenting code for those who have to read it or modify it in the future. 


Equally as important is documentation for those who have to use the code. A short, accurate, 
practical guide to using a Perl class is absolutely necessary in order for the class to be 
generally useful. 


Perl uses a language called POD (plain old documentation) to put documentation right in the 
code. The fourth and final version of Gene. pm has POD documentation embedded in it. 


To gain access to the documentation, you merely have to type: 
perldoc Gene.pm 


in the same directory in which the Gene.pm lives. (For other options, see the perlpod 
manpage on the Web, or type perldoc perlpod.) 


Given that this book contains copious amounts of explanation of the code, I've kept the POD 
documentation to a minimum. The POD language is simple; the best way to use it to write 
good documentation is to copy and modify the documentation style that's used by some 
other well-written module. You will almost always want to give a bit more information than 
the example shown here; try examining the documentation for some Perl modules on your 
computer, for example, perldoc CGI or, if it's installed, perldoc Bioperl. 


The Perl interpreter will ignore everything from a line beginning: 

=headl 

up to a line beginning: 

=Cue 

so you can embed your POD documentation in with your Perl code without difficulty. 

It's also worth pointing out that many filters exist that will take your .pm file with its 
embedded POD documentation and produce versions of the documentation in HTML, LaTEX, 
plain text, nroff, or other formats. 

Here's the output you get from typing perldoc Gene.pm: 

Gene (3) User Contributed Perl Documentation Gene (3) 


Gene 
Gene: objects for Genes with a minimum set of attributes 


Synopsis 
use Gene; 


my Sgenel = Gene->new ( 











name => 'biggene', 

organism => 'Mus musculus’, 

chromosome => '2p', 

pdbref => 'pdb5775.ent', 

author => 'L.G.Jeho', 

date => 'August 23, 1989', 
\e 
print "Gene name is ", S$genel->get_name( ); 
print "Gene organism is ", $genel->get_organism( ); 
print "Gene chromosome is ", Sgenel->get_chromosome( ); 
print "Gene pdbref is ", S$genel->get_pdbref( ); 
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print "Gene author is ", S$genel->get_author( ); 
print "Gene date is ", S$genel->get_date( ); 





Sclone = S$genel->clone(name => 'biggeneclone'); 


S$genel-> set _chromosome('2q'); 
S$genel-> set_pdbref ('pdb7557.ent'); 
S$genel-> set_author('G.Mendel'); 
S$genel-> set_date('May 25, 1865'); 





Sclone=>citation ("T.Morgan', "October 3, 1912'); 


print "Clone citation is ", $clone->citation; 


AUTHOR 
A kind reader 
COPYRIGHT 
Copyright (c) 2003, We Own Gene, Inc. 
2002-11-25 perl v5.6.1 Gene (3) 
Team LiB 
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3.13 Additional Topics 


Included in this section are a few more topics you may find useful. 
3.13.1 Using Class::Struct to Define Classes 


The kind of simple OO class that I've developed in this chapter has proved so useful that 
some clever folks have written a Perl module Class: :Struct that automates the 
construction of classes of this type. 


It's worth examining Class: :Struct because it can be a great timesaver for some situations. 


It's been used to create classes for many widely used modules. Type: 
perldoc Class: :Struct 


to get the whole story. 
3.13.2 Class Inheritance 


An important part of OO programming deals with the use of one class to help define another. 
For instance, you may have a class Protein that defines attributes common to all proteins. 
You can then use the Protein class to define a new class ZincFingers, which perhaps would 
have all the attributes of the Protein class plus some additional attributes relevant to the 
study of zinc fingers. 


You'll see the use of class inheritance in the next chapter. 
3.13.3 Bioperl 


Bioperl is a collection of modules of intense interest to the Perl bioinformatics programmer, 
written mostly in OO style. I'll take a look at the Bioperl software in Chapter 9. 


Team LiB 
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3.14 Resources 


A vast number of techniques are used in OO software development in Perl; many more than I 
have space to explore in this book. For more details than can fit into this book, I recommend 
these sources for more details on OO programming in Perl. 


Object Oriented Perl by Damian Conway (Manning Publishers). This is an excellent 
book and is useful for beginners to advanced. It even includes a few bioinformatics 
examples! My introduction to OO Perl has drawn gratefully on Conway's book. I urge 
readers who will be doing further OO Perl programming to get a copy. Some material 
is slightly dated; for example, the material on pseudohashes should be skipped. 


The perlobj page from the Perl documentation. 


The perlboot tutorial page from the Perl documentation is a beginning introduction to 
Perl objects. 


The perltoot tutorial page from the Perl documentation is a more detailed 
introduction to Perl objects. 





The perltootc tutorial page from the Perl documentation also includes more 
information on class methods. 


The perlbot tutorial page from the Perl documentation is a bag of tricks for Perl OO 
programming. 


Some books already mentioned in earlier chapters have extensive information about 
Perl OO programming, such as Programming Perl, Perl Cookbook, and Advanced Perl 
Programming. 


4 PREVIOUS || NEXT > 


[ Team LiB ] 


Page 138 


[Team LiB J 


3.15 Exercises 


Exercise 3.1 


Write brief descriptions of the main features of declarative programming, OO 
programming, logic programming, and functional programming. (See Section 3.14.) 


Exercise 3.2 


Give an example of a programming job that would be better with OO programming 
than with declarative programming. 


Exercise 3.3 


Give an example of a programming job that would be better with declarative 
programming than with OO programming. 


Exercise 3.4 
What bioinformatics problem might be best addressed with logic programming? 
Exercise 3.5 
Download and use a Perl class from CPAN. 
Exercise 3.6 
Write a Perl class that manages laboratory supplies. 
Exercise 3.7 


When would you want a separate initialization method for a class; when would you 
want the initialization to be part of the new constructor? 


Exercise 3.8 


Modify Gene.pm to keep count of how many objects refer to given organisms, 
chromosomes, authors, pdb references, and names. 


Exercise 3.9 





Add a DESTROY method to a class so an object can self-destruct. 
Exercise 3.10 
Beginning in the code for Gene3.pm you'll find the following regular expression: 


if (SAUTOLOAD =~ /.*(_\wt)/) { 
Sattribute = $1; 


This only catches the last part of a name that has an underscore. What if you want to 
allow names such as get_other var? Write a regular expression that would extract 
such names as other var from get_other var. 


Exercise 3.11 
In the code for Gene2.pm you'll find the following regular mutator method: 


sub set name { 
my ($self, Sname) = @ ; 
S$self->{ name} = Sname if $name; 


} 
This breaks if $name has certain values such as "", 0, or OEO. How can you catch these 


cases? 
Team LiB 
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Chapter 4. Sequence Formats and 
Inheritance 


This chapter applies concepts and techniques from previous chapters to a concrete project: 
handling sequence files. The chapter also introduces a few new techniques including a very 
important one called class inheritance. The code developed in this chapter will also be 
incorporated into later chapters. 


Class inheritance allows you to define a new class by inheriting from other classesaltering or 
making additions as needed. It's a style of software reuse that is particular to object-oriented 
design. 


The first class developed in this chapter is a simple one: reading and writing files. Using 
inheritance, you can extend that class to a new one that can recognize, read, and write data 
in several different biological sequence datafile formats. 


The goal is, as always, to learn enough about Perl to develop software for your own needs. 
The code in this chapter is designed with this goal in mind. In particular, the exercises at the 
end of the chapter will ask you to extend and improve the code in various ways. 


Team LiB 
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4.1 Inheritance 


You've seen the use of modules and how a Perl program can use all the code in a module by 
simply loading it with the use command. This is a simple and powerful method of software 
reuse, by which software can be written once but used many times by different programs. 


You've also seen how object-oriented Perl defines classes in terms of modules, and how the 
use of classes, methods, and objects provides a more structured way to reuse Perl software. 


There's another way to reuse Perl classes. It's possible for a class to inherit all the code and 
definitions of another base class. (This base class is sometimes called a superclass or a 
parent class.) The new derived class (a.k.a. subclass) can add more definitions or redefine 
certain definitions. Perl then automatically uses the definitions of everything in the old class 
(but treats them as if they were defined in this new derived class), unless it finds them first in 
the derived class. 


In this chapter, I'll first develop a class FileIO.pm, and then use the technique of inheritance 
to develop another class SeqFileIO.pm that inherits from FileIO.pm. This way of reusing 
software by inheritance is extremely convenient when writing object-oriented software. For 
instance, I make SeqFileIO do a lot of its work simply by inheriting the base class FileIo 
and then adding methods that handle sequence file formats. I could use the same base class 
FileIo to write a new class that specializes in handling HTML files, microarray datafiles, SNP 
database files, and so on. (See the exercises at the end of the chapter.) 


When inheriting a class, it is sometimes necessary to do a bit more than just add new 
methods. In the SeqFileIo class, I add some attributes to the object, and as a result the 
hash 3 attribute properties also has to be changed. So in the new class I define a new 
hash with that name, and as a result the old definition from the base class is forgotten and 
the new, redefined hash is used. As you read the new class, compare it with the base class 
FileIo, Make note of what is new in the class (e.g., the various put methods), what is being 
redefined from the base class (e.g., the hash just mentioned), and what is being inherited 
from the base class (e.g., the new constructor.) This can help prepare you to write your own 
class that uses inheritance. 


Occasionally, you want to invoke a method from a base class that has been overridden. You 
can use the special SUPER class for that purpose. I don't use that in the code for this chapter, 
but you should be aware that it is possible to do. 


Team LiB 
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4.2 FilelO.pm: A Class to Read and Write Files 


Even though you can easily obtain excellent modules for reading and writing files, this chapter 
shows you how to build a simple one from scratch. One reason for doing this is to better 
understand the issues every bioinformatics programmer needs to face, such as how to 
organize files and keep track of their contents. Another reason is so you can see how to 
extend the class to deal with the multiple file format problem that is peculiar to bioinformatics. 


It's not uncommon for a biologist to use several different types of formats of files containing 
DNA or protein sequence data and translate from one format to another. Doing these 
translations by hand is very tedious. It's also tedious to save alternate forms of the same 
sequence data in differently formatted files. You'll see how to alleviate some of this pain by 
automating some of these tasks in a new class called SeqFileIO.pm. 


Class inheritance is one of the main reasons why object-oriented software is so reusable. In 
order to see clearly how it works, let's start with the simple class FileIO.pm and later use it 
to define a more complex class, SeqFileIO.pm. 


FileIo is a simple class that reads and writes files, and stores simple information such as the 
file contents, date, and write permissions. 


You know that it's often possible to modify existing code to create your own program. When 
I wrote FileIO.pm, I simply made a copy of the Gene. pm module from Chapter 3 and 
modified it. 


On my Linux system, I started by copying FileIO.pm from Gene.pm and giving it a new 
name: 


cp Gene.pm FilelIO.pm 

I then edited the new file FileIO.pm changing the line near the top that says: 
package Gene; 

to: 

package FilelI0O; 


The filename must be the same as the class name, with an additional .pm. 


Though I now needed to modify the module to do what I want, a surprising amount of the 
overall framework of the codeits constructor, accessor and mutator methods, and its basic 
data structuresremains the same. Gene.pm already contained such useful parts as a new 
constructor, a hash-based object data structure, accessor methods to retrieve values of the 
attributes of the object, and mutator methods to alter attribute values. These are likely to be 
needed by most classes that you'll write in your own software projects. 


4.2.1 Analysis of FilelO 


Following is the code for FileIO, with commentary interspersed: 
package FilelI0O; 


m 


# A simple IO class for sequence data files 
n 


use Strict; 

use warnings; 

our SAUTOLOAD; # before Perl 5.6.0 say "use vars 'SAUTOLOAD';" 
use Carp; 
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# Class data and methods 
{ 

# A list of all attributes with defaults and read/write/required/noinit 
properties 


fo) 


my % attribute properties = ( 











_ filename => [ '', "read.write.required'], 
_filedata => fi Lite 'read.write.noinit'], 

_date => fi Es 'read.write.noinit'], 
_writemode => if, VSty 'read.write.noinit'], 


i 


# Global variable to keep count of existing objects 
my $ count = 0; 


# Return a list of all attributes 
sub _all_ attributes { 


keys % attribute properties; 


} 


# Check if a given property is set for a given attribute 
sub permissions { 
my(Sself, Sattribute, S$permissions) = @ ; 


$_ attribute properties {Sattribute}[1] =~ /Spermissions/; 





} 


# Return the default value for a given attribute 
sub attribute default { 
my(S$self, Sattribute) = @ ; 
$_ attribute properties {Sattribute} [0]; 
} 


# Manage the count of existing objects 
sub get count { 
$ count; 


n 


ub incr count { 
++$ count; 





n 


ub _decr_ count { 
==§$ county 








} 
In this first part of FileIO.pm file, the headers are exactly the same as in Gene. pm. 


The opening block, which contains the class data and methods, also remains the same except 
for the hash %_ attribute properties. This new version of the hash has different attributes 
(the filename, the file data, the last modification date of the file, and the mode to use in 
writing a file) tailored to the needs of reading and writing files. 


In addition to the read, write, and required properties, there is also a new "no initialization" 
(or noinit) property. An attribute with the noinit property may not be given an initial value 
when an object is created with a call to the new constructor. In this module, attributes such as 
data from the file, or the date on the file, are set only when the file is read or written. You 
may also have noticed that the default value for the filedata attribute is an anonymous 
array. 


Note that each attribute has both read and write properties. This being the case, you can 
simply omit the listing of the properties. However, in the interest of future modification, when 
I may want to add some attribute that won't have both properties, I've left in the 
specification of the two read and write properties. (Note that one method name, get_count 
, doesn't start with an underscore; this encourages you to call this method to get a count of 
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how many objects currently exist.) 
4.2.1.1 The constructor method 


You'll notice in the following code that I have cut the new constructor down to the bare bones. 


# The constructor method 





# Called from class, e.g. S$obj = FilelIO->new( ); 
sub new { 
my (S$class, Sarg) = @ ; 


# Create a new object 
my Sself = bless { }, Sclass; 





Sclass->_incr_count( ); 
return $self; 


} 
Why did I do so? Read on. 


The read method 


The code continues with the read method: 


# Called from object, e.g. Sobj->read( ); 
sub read { 
my ($self, %arg) = @; 


# Set attributes 








foreach my Sattribute ($self->_all_attributes( )) { 
# E.g. attribute = "filename", argument = "filename" 
my(Sargument) = (Sattribute =~ /* (.*)/); 


# If explicitly given 
if (exists Sarg{Sargument}) { 
# If initialization is not allowed 
if (Sself->_ permissions (Sattribute, 'noinit')) { 
croak("Cannot set Sargument from read: use set_Sargument") ; 








} 
Sself->{Sattribute} Sarg{Sargument}; 
# If not given, but required 
}elsif (S$self->_permissions (S$attribute, 'required')) { 
croak("No Sargument attribute as required"); 
# Set to the default 
}else{ 
S$self->{Sattribute} = $self->_attribute default (Sattribute) ; 








} 
} 


# Read file data 


unless( open( FileIOFH, $self->{ filename} ) ) { 
croak("Cannot open file ". S$self->{ filename} ); 

} 

oseli=>{" faledata’} = [ <FileLOFH> |; 


Sself->{' date'} = localtime((stat FileIOFH) [9]); 
close (FileIOFH) ; 


} 


This new read method has two parts. The first includes the initialization of the object's 
attributes from the arguments and the defaults as specified in the s_ attribute properties 
hash. The second includes the reading of the file and the setting of the filedata and date 
attributes from the file's contents and its last modification time. 


The first loop in the program initializes the attributes. If an attribute is specified as an 
argument, the first test is to see if the noinit property is set. This forbids initializing the 
attribute, in which case the program croaks. Otherwise, the attribute is set. 


If the attribute isn't passed as an argument but has a required property (only the filename 
attribute has the required property), the program croaks. 


Finally, if the argument isn't given and not required, the attribute is set to the default value. 


After performing those initializations, the read method reads in the specified file. If it can't 
open the file, the program croaks. (See the exercises for a discussion of this use of croak.) 


The file is read by the line: 
reelt=>{" tiledata’} = [ <FiléelOrH> |]; 


In list context, the input operator on the opened filehandle, which is given by <FileIOFH> 
reads in the entire file. This is done within an anonymous array, as determined by the square 
brackets around the input operator angle brackets. A reference to this anonymous array 
containing the file's contents is then assigned to the filedata attribute. 


4.2.1.2 stat and localtime functions 


Finally, the Perl stat and localtime functions are called to generate a string with the file's 
last modification time, which is assigned to the object attribute date. 


This method of reading a file makes many choices. For instance, the stat command returns 
an array with many more items of interest about a file, such as its size, owner, access 
permission modes, and so on (the tenth item of which is the modification time). As you 
develop your programs, you should be paying attention to details such as whether you need 
to save some of these additional attributes of a file, the last modification date, or notes about 
the kind of data in the file. 


The next line of code in the program: 
= 


# N.B. no "clone" method is necessary 


# 


is yet another choice to think about. Are there occasions when cloning a file object makes 
sense? Maybe I'd like to clone a file object, make some small change to the data, give it a 
new filename, and write it out. Why have I left this out? 


4.2.1.3 The write method 


The code continues: 
# Write files 











# Called from object, e.g. Sobj->write( ); 
sub write { 
my ($self, %arg) = @ ; 
foreach my Sattribute ($self->_all_attributes( )) { 
# E.g. attribute = "_ filename", argument = "filename" 
my(Sargument) = (S$attribute =~ /* (.*)/); 


# If explicitly given 
if (exists Sarg{Sargument}) { 
Sself->{Sattribute} = Sarg{Sargument}; 
} 
} 


unless( open( FileIOFH, $self->get_writemode . $self->get_filename ) ) { 
croak("Cannot write to file " . Sself->get_ filename) ; 








} 
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unless( print FileIOFH $self->get_filedata ) { 
croak("Cannot write to file " . S$self->get_filename) ; 





} 
Sself->set_date(scalar localtime((stat FileIOFH) [9])); 
close (FileIOFH) ; 





return 1; 


} 


The write method handles writing a file object out to an actual file. First, all arguments 
corresponding to attributes are set as requested. The file is then opened for writing, using the 
_writemode attribute to specify; for example, > for truncating the file before writing or >> for 
appending to the file. The print FileIOFH statement actually does the writing to the opened 
FileIOFH filehandle, retrieving the file data from the object with the get_filedata method 
defined by means of AUTOLOAD. Finally, the object's date attribute is reset to the new 
modification time. 


4.2.1.4 AUTOLOAD 


The next section of code is the AUTOLOAD method itself: 


# This takes the place of such accessor definitions as: 


# sub get_attribute { ... } 
# and of such mutator definitions as: 
# sub set_attribute { ... } 
sub AUTOLOAD { 
my ($self, Snewvalue) = @ ; 
my (Soperation, Sattribute) = (SAUTOLOAD =~ /(get|set) (_\wt)$/); 


# Is this a legal method name? 
unless(Soperation && Sattribute) { 

croak "Method name 'SAUTOLOAD' is not in the recognized form\n"; 
} 
unless(exists Sself->{Sattribute}) { 

croak "No such attribute 'Sattribute' exists in the class ", 

ref ($self); 

} 





# AUTOLOAD accessors 
if(Soperation eq 'get') { 
unless (S$self-> permissions (Sattribute, 'read')) { 
croak "Sattribute does not have read permission"; 


} 


# Turn off strict references to enable symbol table manipulation 
no Striet. “cef's; 
# Install this accessor definition in the symbol table 
*{SAUTOLOAD} = sub { 
my ($self) = @ ; 
unless (S$self-> permissions (Sattribute, 'read')) { 
croak "Sattribute does not have read permission"; 








} 
if (ref (Sself->{Sattribute}) eq 'ARRAY') { 
return @{$self->{Sattribute}}; 
jelse{ 
return $self->{S$attribute}; 
} 
}? 
# Turn strict references back on 
no strict "refs"; 


# Return the attribute value 


# The attribute could be a scalar or a reference to an array 
if (ref (Sself->{Sattribute}) eq 'ARRAY') { 
return @{$self->{Sattribute}}; 
jelse{ 
return S$self->{Sattribute}; 
} 


# AUTOLOAD mutators 
}elsif (Soperation eq 'set') { 
unless (S$self-> permissions (Sattribute, 'write')) { 
croak "Sattribute does not have write permission"; 


} 


# Turn off strict references to enable symbol table manipulation 
no Strick “nets; 
# Install this mutator definition in the symbol table 
*{SAUTOLOAD} = sub { 
my ($self, Snewvalue) = @ ; 
unless ($self-> permissions (Sattribute, 'write')) { 
croak "Sattribute does not have write permission"; 











} 
Sself->{Sattribute} = $newvalue; 





}e 
# Turn strict references back on 
no Street “rets" + 





# Set and return the attribute value 
Sself->{Sattribute} = S$newvalue; 
return S$self->{Sattribute}; 





} 


This AUTOLOAD method has grown! There's only one difference, however, between this code 
and the AUTOLOAD code for the Gene.pm class. The new set of attributes for FileIO.pm don't 
all take simple scalar values, as was the case with Gene.pm. Another attribute, filedata, is 
a reference to an anonymous array. In order for the accessors to return the correct data, 
they must check to see if an attribute is a scalar or a reference to an array; the accessors 
can then dereference and return the data from the method call. 


So the accessors, and the definitions of them installed into the symbol table, test for an array 
reference and dereference it accordingly. Other than that, this AUTOLOAD method is exactly 
the same as that defined for Gene.pm. 


You may also have noticed that sections of code in the AUTOLOAD method are almost identical 
to each other. Recall that AUTOLOAD is invoked when a method with no subroutine defining it is 
called. AUTOLOAD must do two things. First, it performs whatever method is requested; for 
example, if an accessor method is requested, it returns the appropriate value. Second, it 
defines the subroutine that implements the requested method and installs it in the symbol 
table so the next time the method is called, AUTOLOAD and its considerable overhead won't be 
necessary. Because of these parameters, the code AUTOLOAD executes to handle the 
requested method is nearly identical with the method that AUTOLOAD also defines. 


Finally, here are the last sections of the FileIO.pm program: 


# When an object is no longer being used, this will be automatically called 
# and will adjust the count of existing objects 
sub DESTROY { 

my(S$self) = @ ; 

$self->_decr_count( ); 





} 


# Other methods. They do not fall into the same form as the majority handled 
by AUTOLOAD 
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tt 


iF 


The only change here is that there are no other methods (Gene.pm had a citation method). 
4.2.2 Finishing FilelO 


To finish FileIO.pm, here's some very terse (too terse for anything but a textbook) POD 
documentation: 


=headl FileIo 
FileIO: read and write file data 
=headl Synopsis 
use FilelI0O; 
my Sobj = RawfilelO->read ( 
filename => 'jkl1' 
\e 


Sobj->ge 
Sobj->ge 


prin 
prin 


_filename, "\n"; 
_filedata; 


Cr it 


Sobj->set_date('today'); 
print Sobj->get_date, "\n"; 











print Sobj->get _writemode, "\n"; 


my @newdata = ("linel\n", "line2\n"); 
Sobj->set_filedata( \@newdata ); 





Sobj->write (filename => 'lkj'); 
Sobj->write (filename => 'lkj', writemode => '>>'); 





my So = RawfileIO->read(filename => '1kj"'); 
print $o->get_filename, "\n"; 
print $o->get_filedata; 


my Sgenel = Gene->new ( 
name => 'biggene', 
organism => 'Mus musculus', 
chromosome => '2p', 
pdbref => 'pdb5775.ent', 
author => 'L.G.Jeho', 
date => ‘August 23, 1989', 





ez 








print "Gene name is ", $genel->get_name( ); 

print "Gene organism is ", Sgenel->_get_organism( ); 
print "Gene chromosome is ", $genel->_get_chromosome( ); 
print "Gene pdbref is ", $genel->_get_pdbref( ); 

print "Gene author is ", $genel->_get_author( ); 

print "Gene date is ", Sgenel=-> get date( ); 

Sclone = Sgenel->clone (name => 'biggeneclone'); 


$genel-> set chromosome ('2q'); 
S$genel-> set_pdbref('pdb7557.ent"'); 
$genel-> set_author('G.Mendel'); 
Sgenel-> set_date('May 25, 1865'); 
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Sclone->citation('T. Morgan', ‘October 3, 1912'); 


print "Clone citation is ", $clone->citation; 





=headl AUTHOR 

James Tisdall 

=headl COPYRIGHT 

Copyright (c) 2003, James Tisdall 


=cut 


4.2.3 Testing the FilelO Class Module 


Now that we've got a class module, complete with examples of its use, let's write a small test 
program and see how it works. Since the examples in the documentation are, in effect, a 
small test program, let's try running it. We'll use the file file1.txt I created with my text 
editor that contains: 


> sample dna (This is a typical fasta header.) 
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg 
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct 
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc 
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgcetggcgggggt 
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc 
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat 
acacctgagccactctcagatgaggaccta 


I'll take the code from the documentation pretty much as is, just adding strict and 
warnings. I'll also include a use lib directive that adds my development library directory to 
the list of directories in @INC, which tells my computer's Perl where to look for modules. 
(Recall that you can either edit this line, override it with the PERL5LIB environmental variable, 
or give your own directory on the command line.) I also add a few print statements to make 
the output easier to read: 


!/usr/bin/perl 


use strict; 
use warnings; 
use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 





use Filel0O; 





my Sobj = FileIO->new( ); 





Sobj->read ( 
filename => 'filel.txt' 
); 


print "The file name is ", Sobj->get_filename, "\n"; 
print "The contents of the file are:\n", Sobj->get_filedata; 
print "(nthe date of the file as ", Sobj-Sgér. date, \n"; 


Sobj->set_date('today'); 





print "The reset date of the file is ", Sob j->get date, "\n"; 
print "The write mode of the file is ", Sobj->get_writemode, "\n"; 
print "\nResetting the data and filename\n"; 

my @newdata = ("linel\n", "line2\n") ; 


Sobj->set_filedata( \@newdata ); 
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print "Writing a new file \"file2\"\n"; 
Sobj->write (filename => 'file2'); 





print "Appending to the new file \"file2\"\n"; 
Sobj->write (filename => 'file2', writemode => '>>'); 








print "Reading and printing the data from \"file2\":\n"; 


my $file2 = FilelO->new( ); 





Sfile2->read ( 
filename => 'file2' 
); 





print "The file name is ", $file2->get_filename, "\n"; 
print "The contents of the file are:\n", $file2->get_filedata; 





I finally run the test program to get the following output: 


The file name is filel.txt 

The contents of the file are: 

> sample dna (This is a typical fasta header.) 
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg 
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct 
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcec 
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgcetggcgggggt 
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc 
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat 
acacctgagccactctcagatgaggaccta 





The date of the file is Thu Dec 5 11:22:56 2002 
The reset date of the file is today 
The write mode of the file is > 








Resetting the data and filename 

Writing a new file "file2" 

Appending to the new file "file2" 

Reading and printing the data from "file2": 
The file name is file2 

The contents of the file are: 

linel 

line2 

linel 

line2 








The module seems to be performing as hoped. So, now we have a simple module that reads 
and writes files and provides a few options for the write mode. 


But, frankly, this isn't too impressive. You've already been reading and writing files in Perl 
without the overhead of this FileIO module. The interface to the code is nice, and it's good 
to have objects that contain the file data, but what has really been accomplished? 


The real power of this approach is coming up next. Using class inheritance, this simple module 
can be extended relatively easily in a very useful direction. 


It's another case of the basic software engineering approach of making small, simple, 
generally useful tools, and then combining them into more powerful and specific applications. 
So, next, I'll take my simple FileIo class and use it as a base class for a 
bioinformatics-specific class. 


4 PREVIOUS 
NEXT > 
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4.3 SeqFilelO.pm: Sequence File Formats 


Our primary interest is bioinformatics.Can we extend the FileIO class to handle biological 
sequence datafiles? For example, can a class be written that takes a GenBank file and writes 
the sequence out in FASTA format? 


Using the technique of inheritance, in this section I present a module for a new class 
SeqFileIo that performs several basic functions on sequence files of various formats. When 
you call this module's read method, in addition to reading the file's contents and setting the 
name, date, and write mode of the file, it automatically determines the format of the 
sequence file, extracts the sequence, and when available, extracts the annotation, ID, and 
accession number. In addition, a set of put methods makes it easy to present the sequence 


and annotation in other formats.” 
"] Don Gilbert's readseq package (see http://iobio.bio.indiana.edu/soft/molbio/readseg and 
ftp://ftp.bio.indiana.edu/molbio/readseq/classic/src) is the classic program (written in C) for reading and 
writing multiple sequence file formats. 








4.3.1 Analysis of SeqFilelO.pm 


The first part of the module SeqFileIO.pm contains the block with definitions of the new, or 
revised, class data and methods. 


The first thing you should notice is the use command: 


use base ( "FileIO" ); 


This Perl command tells the current package SeqFiletIo it's inheriting from the base class 
FileIo. Here's another statement that's often used for this purpose: 


@ISA = ( "FileIO" ); 


The @ISA predefined variable tells a package that it "is a" version of some base class; it then 
can inherit methods from that base class. The use base directive sets the @ISA array to the 
base class(es), plus a little else besides. (Check perldoc base for the whole story.) Without 
getting bogged down in details use base works a little more robustly than just setting the 
@ISA array, so that's what I'll use here: 


package SeqFilelIoO; 
use base ( "FileIO" ); 


use strict; 

use warnings; 

Huse vars 'SAUTOLOAD'; 
use Carp; 





Class data and methods 


{ 
# A list of all attributes with defaults and read/write/required/noinit 








properties 
my % attribute properties = ( 

_ filename => [ '', "read.write.required'], 
_filedata => f [ d,s 'read.write.noinit'], 
_date => fF My 'read.write.noinit'], 
_writemode => [ "Ss"; 'read.write.noinit'], 
_ format =>. Vy 'read.write'], 

_ sequence => if PEs 'read.write'], 
_header => [PEs 'read.write'], 

_id => fe, 'read.write'], 
_accession => we, 'read.write'], 
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)? 


# Return a list of all attributes 
sub all attributes: { 
keys % attribute properties; 


} 


# Check if a given property is set for a given attribute 
sub permissions { 
my(Sself, S$attribute, S$permissions) = @ ; 


$_attribute properties {Sattribute}[1] =~ /Spermissions/; 





} 


# Return the default value for a given attribute 
sub attribute default { 
my(Sself, Sattribute) = @ ; 
$_attribute_ properties {Sattribute} [0]; 
} 





my @ seqfileformats = qw( 
_raw 
—embl 
_fasta 
_9C9 
_genbank 
pir 
_staden 
); 
sub isformat { 
my ($self) = @; 


for my $format (@ seqfileformats) { 
my Sis format = "is$format"; 


if(S$self->Sis format) { 
return Sformat; 
} 
} 


return 'unknown'; 


} 

4.3.1.1 The power of inheritance 

A comparison with the opening block in the base class FileIo is instructive. You'll see that I'm 
redefining the 3 attribute properties hash and the methods in the block that access the 


hash as closures. This is necessary to extend the new class to handle new attributes that 
relate specifically to sequence datafiles: 
_ format 

The format of the sequence datafile, such as FASTA or GenBank 
_ sequence 

The sequence data extracted from the sequence datafile as a scalar string 
_header 

The header part of the annotation; defined somewhat loosely in this module 
aid 

The ID of the sequence datafile, such as a gene name or other identifier 


_accession 
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The accession number of the sequence datafile, when provided 


You'll notice, in comparing this opening block with the opening block of the base class FileIo, 
that the methods and variables relating to counting the number of objects has been omitted 
in this new class. 


Here's where you see the power of inheritance for the first time. When the code in the new 
SeqFileIO.pm module tries to call, say, the get_count method, and Perl sees that it's not 
defined in the module, it will go on a hunt for the method in the base class and use the 
definition it finds there. On the other hand, if it finds the method in SeqFileIO.pm, it just uses 
that; if the get_count method appeared in the base class but is redefined in SeqgFileIO.pm, it 
uses that redefinition and ignores the older definition in the base class. 


So, you don't have to rewrite methods in the base class(es); you can just call them. Perl not 
only finds them, it also calls them as if they were defined in the new class. For example, 

ref (Sself) returns SeqFileIO, not File1Io, without regard to whether the method is defined 
in the new class or inherited from the old class. 


Finally, there are some new definitions in this new version of the opening block. An array 

@ seqfileformats lists the sequence file formats the module Knows about, and a method 
isformat calls the methods associated with each format (such as is_ fasta defined later in 
the module) until it either finds what format the file is in or returns unknown. 


The @ seqfileformats array uses the qw operator. This splits the words on whitespace 
(including newlines) and returns a list of quoted words. It's a convenience for giving lists of 
words without having to quote each one or separate them by commas. (Check the perlop 
manpage for the section on "RegExp Quote-Like Operators" for all the variations and 
alternative quoting operators.) 


Following the opening block, notice that there is no new method defined in this class. The 
simple, bare-bones new method from the FileIO class serves just as well for this new class; 
thanks to the inheritance mechanism, there's no need to write a new one. A program that 
calls SeqFileIO->new will use the same method from the FileTo class, but the class name 
will be properly set to SeqFileIo for the new object created. 


4.3.1.2 A new read method 


The next part of the program has a rewrite of the read method. As a result, the read method 
from the parent FileIo class is being overridden: 


# Called from object, e.g. Sobj->read( ); 
sub read { 
my ($self, %arg) = @; 


# Set attributes 











foreach my Sattribute ($self->_all_attributes( )) { 
# E.g. attribute = " filename", argument = "filename" 
my(Sargument) = (Sattribute =~ /* (.*)/); 


# If explicitly given 
if (exists Sarg{Sargument}) { 
# If initialization is not allowed 
if (S$self->_ permissions (S$attribute, 'noinit')) { 
croak("Cannot set Sargument from read: use set_Sargument") ; 








} 
Sself->{Sattribute} = Sarg{Sargument}; 
# If not given, but required 
}elsif ($self->_permissions (S$attribute, 'required')) { 
croak("No Sargument attribute as required"); 
# Set to the default 
jelse{ 
Sself->{Sattribute} = $self->_attribute default (Sattribute) ; 
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} 


# Read file data 

unless( open( FileIOFH, $self->{ filename} ) ) { 
croak("Cannot open file ". S$self->{ filename} ); 

} 

Sself->{' filedata’} = [ <FilelOFH> ]; 

Sself->{' date'} = Localtime ((stat File1OFH) [9)); 

S$self->{' format'} = $self->isformat; 

my Sparsemethod = 'parse' . S$self->{' format'}; 

Sself->Sparsemethod; 


close (FileIOFH) ; 
} 


The new read method just shown is almost exactly the same as the read function in the 
FileIo class. There are only three new lines of code, just before the end of the subroutine, 
preceding the call to the close function: 


Sself->{' format'} = $self->isformat; 
my S$parsemethod = 'parse' . $self->{' format'}; 
Sself->Sparsemethod; 


These three new lines initialize the new attributes that were defined for this class. First, a call 


to isformat determines the format of the file and sets the _format attribute appropriately. 


The appropriate parse_ method name is then constructed, and finally, a call to that method is 


made. As you're about to see, the parse_ methods extract the sequence, header, ID, and 
accession number (or as many of these as possible) from the file data and set the object's 
attributes accordingly. 


So, by the end of the read method, the file data has been read into the object, the format 
determined, and the interesting parts of the data (such as the sequence data) parsed out. 


4.3.2 New Methods: is, parse, and put 


The rest of the module consists of three groups of methods that handle the different 
sequence file formats. These methods didn't appear in the more simple and generic FileIo 
module and must be defined here: 


is 

Tests to see if data is in a particular sequence file format 
parse 

Parses out the sequence, and when possible the header, ID, and accession number 
put 


Takes the sequence attribute, and when available the header, ID, and accession 
number attributes, and writes them in the sequence file format named in the format 
attribute 


4.3.2.1 is_ methods 


The first group of methods tests an object's file data to see if it conforms to a particular 
sequence file format: 


sub is_raw { 
my($self) = @ ; 


my S$seq = join('', @{Sself->{ filedata}} ); 
(Sseq =~ /*[ACGNT\s]+$/) ? return 'raw' : 0; 
} 


sub is_embl { 
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my($self) = @ ; 
my (Sbegin, $seq,$end) = (0,0,0); 


foreach( @{S$self->{_filedata}} ) { 

/*ID\s/ && Sbegint++; 

/*SQ\s/ && Sseqtt; 

/*\/\// && Sendtt; 

((Sbegin = = 1) && (Sseq = = 1) && (Send = = 1)) && return 'embl1'; 
} 
return; 


} 


sub is fasta { 
my($self) = @ ; 


my($flag) = 0; 
for (@{S$self->{ filedata}}) { 


#This to avoid confusion with Primer, which can have input beginning ">" 
/*\*seq.*:/i && (Sflag = 0) && last; 


if( /*>/ && $flag = = 1) { 
easy 
}elsif( /*>/ && Sflag = = 0) { 
Sflag = 1; 
}elsif( (! /*>/) && Sflag = = 0) { #first line must start with ">" 
esis 
} 
} 
Sflag ? return 'fasta' : return; 


} 


sub is_gcg { 
my ($self) = @ ; 


my($i,$j) = (0,0); 


for (@{S$self->{ filedata}}) { 

/*\s*S8/ && next; 

/Length:.*Check:/ && (Si += 1); 

/*\s*\dt+\s* [a-zA-Z\s]t+/ && ($j += 1); 

($i = = 1) && ($3 = = 1) && return('gcg'); 
} 


return; 


sub is_genbank { 
my($self) = @ ; 


my SFeatures = 0; 


for (@{S$self->{ filedata}}) { 
/*LOCUS/ && (S$Features += 1); 
/*“DEFINITION/ && (SFeatures += 2); 
/*ACCESSION/ && (SFeatures += 4); 
/“ORIGIN/ && (SFeatures += 8); 
/*\/\// && ($Features += 16); 
(SFeatures = = 31) && return 'genbank'; 
} 


return; 
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sub is pir { 


my($self) = @ ; 
my (Sent, $ti, $date, Sorg,$ref, $sum,$seq,$end) = (0,0,0,0,0,0,0,0); 
for (@{S$self->{_filedata}}) { 


r 

/ENTRY/ && Sent++; 

/TITLE/ && Stitt; 

/DATE/ && Sdatet+t; 

/ORGANISM/ && Sorgt+; 

/REFERENCE/ && Sref++; 

/SUMMARY/ && Ssumt++; 

/SEQUENCE/ && Sseqt+t; 

/\/\/\// && Send++; 

Sent = = 1 && Sti = = 1 && Sdate >= 1 && Sorg >= 1 
&& Sref >= 1 && Ssum = = 1 && Sseq = = 1 && Send = = 1 
&& return 'pir'; 






































} 


return; 


} 


sub is_staden { 

my($self) = @ ; 

for (@{Sself->{ filedata}}) { 

/<-+([*-]*)-+>/ && return 'staden'; 

} 
} 
is_ methods are designed to be fast. They don't check to see if the file conforms to the 
official file format in every detail. Instead, they look to see if, given that the file is supposed to 
be a sequence file, there is a good indication that it is a particular file format. In other words, 
these methods perform heuristic, not rigorous, tests. If they are well conceived, these 
methods correctly identify the different formats without confusion and with a minimum of 
code and computation. However, they don't ensure that a format is conforming in every 
respect to the official format definition. (See the exercises for other approaches to file format 
recognition. ) 


Also, note that these are fairly simple tests; for example, the is_raw method doesn't check 
for the IUB ambiguity codes but only the four bases A, C, G, T, plus N for an undetermined 
base. Furthermore, some software systems have more than one format, for which only one 
format is shown here, such as PIR and GCG. The bottom line is that this code does what it is 
intended to do, but not more. It is, however, fairly short and easy to modify and extend, and 
I encourage you to try your hand at doing so in the exercises. 


An interesting Perl note: the "leaning toothpick" regular expression notation for three forward 
slashes /// is a bit forbidding because each forward slash needs a backslash escape in front of 
it: 


/\/\/\// && Sendt++; 
An alternative would be to use m and a separator other than /, which leads to more readable 
code: 

m!///! && Sendt++; 
One more interesting Perl note: in this code, I return a false value from a subroutine like so: 
return; 
This is a bit confusing because unless you already know or check in the documentation for 
return, it's not clear what's happening. In a scalar context, return; returns the undefined 


value undef; ina list context, return; returns an empty list. So, no matter what context the 
subroutine is called from, a valid value is returned. 
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4.3.2.2 put_ methods 


The next group of put_ methods returns an object's sequence and annotation data in a 


particular sequence file format: 


sub put_raw { 


my($self) = @ ; 
my (Sout) ; 
(Sout = $self->{_sequence}) =~ tr/a- 


return (Sout); 


} 


sub put_embl { 
my($self) = @ ; 


my (@out, $tmp, $len, $i,$j,$a,$c,$g,$t, 


$len = length ($self->{_sequence}) 


Sa=(Sself->{ sequence} =~ tr/Aa//); 
$c=($self->{ sequence} =~ tr/Cc//); 
$g=(Sself->{ sequence} =~ tr/Gg//); 
St=(Sself->{ sequence} =~ tr/Tt//); 


$o=(Slen - $a - $c - $g - St); 

$i=0; 

Sout [$i++] = sprintf ("ID 

Sout [Sit+] = "KXxX\n"s 

Sout [$i++] = sprintf ("SQ 
other; \n", 


sequence 


$s s\n", 


z/A-Z/; 


So); 


$self->{_header}, $self->{_id} ); 


sd BP; sd A; 


ole 
OQ, 
Q 
ole 
[er 
Q 
ole 
Q 
HW 
ol? 
oO; 


Slen, Sa, $c, Sg, $t, $0); 


for($j = 0; $3 < Slen; ) { 


Sout [$i] .= sprintf ("%s",substr($self->{ sequence},$j,10)); 


$j += 10; 
if( $j < S$len && $j % 60 !=0 ) 
Sout [$i] .= " "; 
jelsif ($3 % 60 = =0) { 
Sout [$i++] .= "\n"; 
} 
} 
if($j * 60 !=0) { 
Sout [$i++] .= "\n"; 
} 
Sout(Sil = "//\n"; 
return @out; 


} 


sub put_fasta { 
my($self) = @ ; 


my (@out,$len,$i,$4); 


S$len = length ($self->{_sequence}); 
Si = 0; 

pout Sit+] = "ao" 
for (Sj=07 Si<Slen + $7 += 50) 4 


S$self->{ header} 


{ 


WEN ry ts 


Sout [$it++]=sprintf ("%.50s\n",substr ($self->{_sequence},$j,50)); 


} 


return @out; 


} 


sub put_gcg { 
my($self) = @ ; 
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my (@out, $len, $i,$j,$cnt,$sum) ; 
S$len = length ($self->{_sequence}); 


#calculate Checksum 
for (Si=0; Six<Slen ;Si+t+) { 





Sent++; 
$sum += S$cnt * ord(substr($self->{_sequence},$i,1)); 
(Sent = = 57)&& (Sent=0); 


} 
Ssum %= 10000; 


Sa. = OQ; 

Sout [Sit] sprintf ("%ss\n", S$self->{_header}) ; 

Sout [Si++] = sprintf (" $s Length: %d (today) Check: %d ..\n", 
Sself->{_id}, 


Slen, S$sum); 
for($j = 0; $j < Slen; ) { 
if( $3 50 = = 0) { 
Sout[s2], —sprintt("ssd. S$ y+h)¢ 
} 
Sout [$i] .= sprintf ("%s",substr(S$self->{_sequence},$j,10)); 


ol? 





$j += 10; 
if( $j < Slen && $3 % 50 !=0) { 
Sout [$i] .= " "; 
jelsif ($3 % 50 = =0) { 
Sout [$i++] .= "\n"; 
} 
} 
LE(SS S50 t= 0.) 4 
Sout [$i] .= "\n"; 


return @out; 


} 


sub put_genbank { 
my(S$self) = @ ; 


my (@out, $len, $i,$j3,$cnt,$sum) ; 
my ($seq) = $self->{ sequence}; 


Sseq =~ tr/A-Z/a-z/; 
Slen = length ($seq); 
for ($i=0; $i<Slen ;Sit+) { 





Sent#+; 
Ssum += Scnt * ord(substr($seq,$i,1)); 
(Sent = = 57) && (Sent=0); 


} 

Ssum %= 10000; 

$i = 0; 

Sout [$i++] = sprintf ("LOCUS 

Sout [$i++] = sprintf ("DEFINITION % 
$self->{_header}, 


ole 





s $d bp\n",Sself=>{. 1d}; $len) 7 
s , %d. bases, Sd. sum. \n", 





Slen, S$sum); 
Sout [$i++] = sprintf ("ACCESSION $%s\n", Sself->{_accession}, ); 
Sout [Sit+] sprintf ("ORIGIN\n") ; 
for($j = 0 ; $j < Slen ; ) { 
if( $3 60 = =0) { 
Sout[$i] = sprintf("S38d ",$j+1); 
} 
Sout [$i] .= sprintf("%Ss",substr($seq,$j,10)); 








$j += 10; 
if( $j < Slen && $j 2 60 !=0) { 
Sout[$i] .= " "; 
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if($j % 60 !=0) { 
Sout [$i] .= "\n"; 
cee Or 


} 
Sout [$i] = "//\n"; 
return Gout; 


} 


sub put_pir { 
my($self) = @ ; 


my (S$seq) = S$self->{_ sequence}; 
my (@out, $len, $i,$j,Scnt,$sum) ; 
Slen = length ($seq); 

for (Si=—0> SisSlen 7Sit+) 4 


















































Sente+: 
Ssum += Scnt * ord(substr($seq,$i,1)); 
(Sent= =57) && ($cent=0); 
} 
Ssum $= 10000; 
$i = 0; 
Sout [$i++] = sprintf ("ENTRY s\n", $self->{_id}); 
Sout [$i++] = sprintf ("TITLE $s\n",S$self->{ _header}); 
#JDT ACCESSION out if defined 
Sout [$i++] = sprintf ("DATE Se \ yl hg 
Sout [$i++] = sprintf ("REFERENCE Se \n » YF 
Sout [Si++] = sprintf ("SUMMARY #Molecular-weight %d #Length %d #Checksum 
Sa\ 
n",0,Slen, $sum) ; 
Sout [$i++] = sprintf ("SEQUENCE\n") ; 
Sout [$i++] = sprintf(" 5 10 aS 20 25 











B0\n")F 
for ($j=1; $seq && $j < $len ; $4 += 30) { 
Sout(Sis+] = sprimer ("S7a. ",$4)4 
Sout [$i++] = sprintf("%$s\n", join(' ',split(//,substr($seq, $j - 
1, length ($seq) 
< 30 ? length($seq) : 30))) ); 
} 
Sout[$it+] = sprintf ("///\n"); 
return @out; 


sub put_staden { 
my($self) = @ ; 


my ($seq) = $self->{ sequence}; 
my ($i,$j,$len,@out) ; 


$i = 0; 
$len = length ($self->{_sequence}); 
Sout (Si) = "FP \<sssessSeeessesees= AP\n se 


substr (Sout [$i],int((20-length ($self->{_id}))/2),length(S$self->{_id})) = 
Sself-> 
{_id}; 
Say 
for (Sj=0; $j<Slen ; $j+=60) { 
Sout [$it++]=sprintf ("%s\n",substr ($self->{_sequence},$j,60)); 
} 
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return @out; 


} 


The put_ methods are, by necessity, more cognizant of the detailed rules of a particular file 
format. 


Note the Perl function ord in put_gcg. This built-in function gives the numeric value of a 
character. Other Perl functions such as sprintf and substr can be reviewed as necessary by 
typing perldoc -f substr (for example) or by visiting the http://www.perdoc.com web 


page. 





4.3.2.3 parse_ methods 


parse_, the third and final group of methods, parses the contents of a file in a particular 
format, extracting the sequence and, if possible, some additional header information. 


sub parse raw { 
my($self) = @ ; 


## Header and ID should be set in calling program after this 
my ($seq) ; 


S$seq = join('',@{$self->{_ filedata}}); 





if( (S$seq =~ /*([acgntACGNT\-\s]+)$/)) { 
(Sself->{_ sequence} = $seq) =~ s/\s//g; 
selse{ 


carp("parse raw failed"); 


} 


sub parse _embl { 
my($self) = @ ; 


my (Sbegin, $seq, $end, $count) = (0,0,0,0); 
my ($sequence, $head, $acc, $id); 


for (@{S$self->{ filedata}}) { 
++Scount; 
if(/*ID/) { 
Sbegint++; 
/ID\S* (2*\8) \8*/ && ($id = (Shead = $1)) =~ s/\su*//7 
felsif (/*SQ\s/) { 
Sseqtt; 
Jelsif(/*\/\//){ 
Send++; 
jelsif(Sseq = = 1){ 
Ssequence .= $ ; 
}elsif (/*AC\s*(.* (7 /\S)).*/) { #put this here - AC could be sequence 





Sacc .= $1; 
} 
if($begin = = 1 && Sseq = = 1 && Send = = 1) { 
Ssequence =~ tr/a-zA-Z//cd; 
Ssequence =~ tr/a-z/A-Z/; 
Sself->{_ sequence} = Ssequence; 





S$self->{_ header} = $head; 
$self->{_id} = $id; 
$self->{_accession} = $acc; 
return 1; 


return; 
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sub parse fasta { 
my ($self) = @ ; 


my(S$flag,$count) = (0,0); 
my ($seq, Shead, $id); 


for (@{$self->{ filedata}}) { 
#avoid confusion with Primer, which can have input beginning ">" 





/*\*seq.*:/i && (Sflag = = 0) && last; 

if (/*>/ && $flag = = 1) { 
Last; 

}elsif(/*>/ && Sflag = = 0O){ 
/*>\s*(.*\S)\s*/ && (Sid=(Shead=$1)) =~ s/\s.*//; 
Sflag=1; 

}elsif( (! /*>/) && Sflag = = 1) { 
$seq .= S$; 

}elsif( (! /*>/) && Sflag = = 0) f{ 

Las 

} 

++Scount; 


} 

if($flag) { 
Sseq =~ tr/a-zA-Z-//cd; 
Sseq =~ tr/a-z/A-Z/; 


$self->{_ sequence} = $seq; 
S$self->{ header} = $head; 
S$self->{_id} = $id; 





} 


sub parse gcg { 
my($self) = @ ; 


my ($seq, Shead, Sid); 
my(Scount,$flag) = (0,0); 


for (@{S$self->{ filedata}}) { 
LEC/?\S*SY) of 


}Jelsif(Sflag = = 0 && /Length:.*Check:/) { 
/*\s* (\St+) .* Length: .*Check:/; 
Sflag=1; 
(S$id=$1) =~ s/\s.*//; 
}elsif(Sflag = = 0 && /*\S/) { 
(Shead = $_) == s/\n//; 
pelsift(S$flag = = 1 @& /“\s* [*\d\sl]/) { 
ast? 
}elsif($flag = = 1 && /*\s*\dt+\s*[a-zA-Z \t]+$/) { 
$seq .= $_; 
} 
Scount++; 
} 
Sseq =~ tr/a-zA-Z//cd; 
Sseq =~ tr/a-z/A-Z/; 


Shead = Sid unless Shead; 





Sself->{_ sequence} = $seq; 
Sself->{ header} = Shead; 
$self->{_id} = $id; 


return 1; 
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} 

# 

# 

sub parse genbank { 
my($self) = @ ; 


my (Scount, $features, $flag,$seqflag) = (0,0,0,0); 


my ($seq, Shead, $id, $acc) ; 


for (@{$self->{ filedata}}) { 











if( /*LOCUS/ && S$flag = =0) { 
/*LOCUS\S* (.*\S) \s*8/; 
(S$id=(Shead=$1)) =~ s/\s.*//; 
Sfeatures += 1; 
Sflag = 1; 

belsif( /“DEFINITION\s*(.*)/ 6& Sfleqg = = 1) 
Shead .<= " Sis 
Sfeatures += 2; 

}elsif( /*ACCESSION/ && $flag = =1) { 
/*ACCESSION\s* (.*\S) \s*$/;7 
Shead ,=] " ™. (Sace=S1)' + 


Sfeatures += 4; 
}elsif( /*ORIGIN/ ) { 
Sseqflag = 1; 
Sfeatures += 8; 
Peleit( /“\AN\/7 ) 4 
Sfeatures += 16; 








}elsif( S$seqflag = = 0) { 

}elsif(Sseqflag = = 1) { 
Sseq .= 8 ; 

} 

++$ count; 

if($features = = 31) { 
Sseq =~ tr/a-zA-Z//cd; 
Sseq =~ tr/a-z/A-Z/; 





$self->{_ sequence} = $seq; 
S$self->{_ header} = $head; 
$self->{_id} = $id; 
Sself->{_ accession} = $acc; 


return 1; 
return; 
} 


sub parse pir { 
my($self) = @ ; 


my ($begin, $tit, $date, $Sorganism, $ref, Ssummary, $seq, Send, $count) 


(0,0,0,0,0,0,0,0,0); 
my (Sflag,$S$seqflag) = (0,0); 
my (Ssequence, Sheader,S$id,S$acc) ; 





for (@{S$self->{ filedata}}) { 
++$ count; 
if( /*ENTRY\s*(.*\S)\s*$/ && $flag = =0) { 
Sheader=$1; 
Sflag=1; 
Sbegin+t+; 








J}elsif( /*TITLE\s*(.*\S)\s*$/ && $flag = = 1 
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Sheader .= $1; 
Stith: 

}elsif( /ORGANISM/ ) { 
Sorganism+t; 




















}elsif( /*ACCESSIONS\s*(.*\S)\s*S$/ && Sflag = =1) { 
(Sid=(Sace=$1)) = e/\s*/ 77 

}elsif( /DATE/ ) { 
Sdatet+; 

}elsif( /REFERENCE/ ) { 
Sreftt+; 

}elsif( /SUMMARY/ ) { 








Ssummaryt++; 
}elsif( /*SEQUENCE/ ) { 
Sseqflag = 1; 

















Sseqtt; 
}elsif( /*\/\/\// && $flag = =1) { 
Sendt++; 
}elsif( S$seqflag = = 0) { 
next; 
}elsif( S$seqflag = = 1 && $flag = = 1) { 
Ssequence .= $ ; 
} 
if( Sbegin = = 1 && Stit = = 1 && Sdate >= 1 && Sorganism >= 1 
&& Sref >= 1 && Ssummary = = 1 && Sseq = = 1 && Send = = 1 
) { 
Ssequence =~ tr/a-zA-2Z//cd; 
Ssequence =~ tr/a-z/A-Z/; 





$self->{ sequence} = $seq; 
Sself->{ header} = Sheader; 
$self->{_id} = $id; 
$self->{_accession} = $acc; 





return 1; 


return; 


} 


sub parse staden { 
my($self) = @ ; 


my(Sflag,$count) = (0,0); 
my ($seq, Shead, $id); 
for (@{S$self->{ filedata}}) { 


if( /<---*\s* (.*[%-\s])\s*-*--->(.*)/ && Sflag = =0) { 
Sid = Shead = $1; 
Sseq .= $2; 
Sflag = 1; 

}elsif( /<---*(.*)-*--->/ && S$flag = =1) { 
Scount--; 
last; 

}elsif( $flag = =1) { 
$seq .= $_; 

} 

++Scount; 


} 

Li( Stlag ) { 
Sseq =~ s/-/N/g; 
Sseq =~ tr/a-zA-Z-//cd; 
Sseq =~ tr/a-z/A-Z/; 





$self->{_ sequence} = $seq; 
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Sself->{ header} = Shead; 
$self->{_id} = $id; 


return 1; 


} 


return; 


} 


Ly 


That's the end of the SeqFileIO.pm module that defines the SeqFilelIo class. 


You can see that to add the capability to handle a new sequence file format, you simply have 
to write new is_,put_, and parse_ methods, and add the name of the new format to the 
@ seqfiletypes array. So, extending this software to handle more sequence file formats is 


relatively easy. (See the exercises. ) 


To end this chapter, here is a small test program testSeqFileIo that exercises the main 
parts of the SeqFileIO.pm module, followed by its output. See the exercises for a discussion 


of ways to improve this test. 


4.3.3 Testing SeqFilelO.pm 


Here is the program testSeqFilelIo to test the SeqFileIo module: 
#!/usr/bin/perl 


use strict; 
use warnings; 


use SeqFilelI0O; 


First test basic FileIO operations 
(plus filetype attribute) 





my Sobj = SeqFileIO->new( ); 





Sobj->read ( 
filename => 'filel.txt' 
); 





print "The file name is ", Sobj->get_filename, "\n"; 

print "The contents of the file are:\n", Sobj->get_filedata; 

print "\nThe date of the file is "; Sobj->get date, "\n"; 

print "The filetype of the file is ", Sobj->get_ filetype, "\n"; 
Sobj->set_date('today'); 

print "The reset date of the file is ", Sob) ->get_date, "\n"; 
print "The write mode of the file is ", Sobj->get_writemode, "\n"; 
print "\nResetting the data and filename\n"; 





my @newdata = ("linel\n", "lLine2\n"); 
Sobj->set_filedata( \@newdata ); 


print "Writing a new file \"file2.txt\"\n"; 
Sobj->write (filename => 'file2.txt'); 





print "Appending to the new file \"file2.txt\"\n"; 
Sobj->write (filename => 'file2.txt', writemode => '>>'); 








use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 
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print 


"Reading and printing the data from \"file2.txt\":\n"; 





my Sfile2 = SeqFileIO->new( ); 


Sfile2->read ( 
filename => 'file2.txt' 


); 



































print "The file name is ", $file2->get_filename, "\n"; 

print "The contents of the file are:\n", $file2->get_filedata; 
print "The filetype of the file is ", Sfile2-sget_ filetype, "\Wi"y 
print <<'HEADER"; 

Hara ae ae ae aE EE EP 

' Test file format recognizing and reading 

ia ie ee 

HEADER 

my Sgenbank = SeqFilelIO->new( ); 





Sgenbank->read ( 
filename => 'record.gb' 


)3 


print "The file name is ", Sgenbank->get_ filename, "\n"; 

print "\nThe date of the file is ", $genbank->get_date, "\n"; 

print "The filetype of the file is ", $genbank->get_filetype, "\n"; 

print "The contents of the file are:\n", Sgenbank->get_filedata; 

Print "\ntHHHEE EEE EE EH HH HH HEH OF Fe EE EEE EEE EE EE HE HEH OF FH EE EE EE EE EE EE EE EE HO" | 
my Sraw = SeqFileIO->new( ); 





Sraw->read ( 
filename => 'record.raw' 


)3 








print "The file name is ", $raw->get_filename, "\n"; 
print "\nThe date of the file is ", S$raw->get_date, "\n"; 
print "The filetype of the file is ", Sraw->get_filetype, "\n"; 
print "The contents of the file are:\n", Sraw->get_filedata; 
print "\ntteH ee EE EEE EH HH HH HEH OF FE EE EEE EEE EE EE EE HSH OF FH HE EE EE EE EE EE EE RE SH 1" | 
my Sembl = SeqFilelIO->new( ); 
Semb1->read ( 
filename => 'record.embl' 
); 
print "The file name is ", Sembl->get filename, "\n"; 
print "\nThe date of the file is ", Sembl->get_ date, “\n"? 
print "The filetype of the file is ", Sembl->get_filetype, "\n"; 
print "The contents of the file are:\n", Sembl->get_filedata; 
print "\ntHeH ee EEE EEE HE HH HH HEH OF Fe EE EEE EEE EE EE HE HEH OF He EE EE EE EE EE EE EE EE 1"; 
my Sfasta = SeqFileIO->new( ); 





Sfasta->read ( 
filename => 'record.fasta' 


\; 

prant 
print 
print 





"The file name is ", $fasta->get_filename, "\n"; 
"\nthe date of the file is"; Sfasta->get_ date, "\n"; 
"The filetype of the file is ", Sfasta=->get.filetype, "\n"; 
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print "The contents of the file are:\n", $fasta->get_filedata; 


print "\nde td tae EE EH HEE EE Oo EE EEE aE EE RE OEE HE EE REE EEO" | 


my Sgcg = SeqFileIO->new( ); 
Sgcg->read ( 
filename => 'record.gcg' 








\; 

print "The file name is ", $gcg->get_filename, "\n"; 

print "\nThe date of the file is ", $gcg->get_date, "\n"; 

print "The filetype of the file is ", $gcg->get_filetype, "\n"; 
print "The contents of the file are:\n", Sgcg->get_filedata; 


print "\nde td daa aE EH HERE EA Oo EE EEE HERE EE RE OEE aR EE EE RR EERE OD" | 





my Sstaden = SeqFilelIO->new( ); 
Sstaden->read ( 

filename => 'record.staden' 
)3 








print "The file name is ", $staden->get_ filename, "\n"; 

print "\nihe date of the file 16 ">; Sstaden=>get idate;, "\al; 

print "The filetype of the file is ", $staden->get_filetype, "\n"; 
print "The contents of the file are:\n", Sstaden->get_filedata; 


print "\nde dd tera aE Ee HERE OE EEE a AT EE ERR OEE aR EEE EEE OO" | 


print <<'REFORMAT'; 








Hae aE aE ae aa ae a Ta aa aa 




















# 

# Test file format reformatting and writing 

# 

HEEGE EEE EEE HEE EEE EE EEE HH EE EE EEE EE EE EEE FHF 

REFORMAT 

print "At this point there are ", S$staden->get_count, " objects.\n\n"; 
print "#####F# \ntket## Testing put methods\n######\n\n"; 
print "\nPrinting staden data in raw format:\n"; 

print S$staden->put_raw; 

print "\nPrinting staden data in embl format:\n"; 

print Sstaden->put_embl; 

print "\nPrinting staden data in fasta format:\n"; 
print Sstaden->put_fasta; 

print "\nPrinting staden data in gcg format:\n"; 

print S$staden->put_gcg; 

print "\nPrinting staden data in genbank format:\n"; 
print S$staden->put_genbank; 

print "\nPrinting staden data in PIR format:\n"; 

print S$staden->put_pir; 














The test program depends on certain files being present on the system; the contents of the 
files are clear from the program output, and the files can be downloaded from this book's 
web site, along with the rest of the programs for this book. 
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4.3.4 Results 


Here is the output from testSeqFileIo: 


The file name is filel.txt 
The contents of the file are: 

> sample dna (This is a typical fasta header.) 
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg 
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct 
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc 
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgcetggcgggggt 
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc 
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat 
acacctgagccactctcagatgaggaccta 








The date of the file is Thu Dec 5 11:22:56 2002 
The filetype of the file is fasta 

The reset date of the file is today 

The write mode of the file is > 








Resetting the data and filename 
Writing a new file "file2.txt" 
Appending to the new file "file2.txt" 
Reading and printing the data from "file2.txt": 
The file name is file2.txt 

The contents of the file are: 

linel 

line2 

linel 

line2 

The filetype of the file is _unknown 








are aE EEE HH EEE EE HEE EEE EH HE EERE EH HEH EEE 
# 
# Test file format recognizing and reading 
# 
HERE EE HEHEHE HE HEHE HEE EE EE EE EE EE EE EE EE EERE 


The file name is record.gb 





The date of the file is Sun Mar 30 14:30:09 2003 

The filetype of the file is _genbank 

The contents of the file are: 

LOCUS ABO031069 2487 bp mRNA PRI 27-MAY-2000 
DEFINITION Sequence severely truncated for demonstration. 

ACCESSION ABO031069 




















VERSION ABO31069.1 GI:8100074 
KEYWORDS . 
SOURCE Homo sapiens embryo male lung fibroblast cell _line:HuS-L12 cDNA 
Lo 
mRNA. 


ORGANISM Homo sapiens 
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
REFERENCE 1 (sites) 
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and 
Takano,T. 

TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC 
domain, 
































is regulated by proteolysis 
JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) 
MEDLINE ZOU ZGEL2A56 

REFERENCE 2 (bases 1 to 2487) 
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AUTHORS 


TITLE 
JOURNAL 





of 





FEATURES 
SOure 











gene 


CDS 


CXXC 


Fug ine, T. 
Takano,T. 
Direct Su 
Submitted 
Tadahiro 


Microbiol 
(E-mail:f 





, Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. 
bmission 
(15-AUG-1999) to the DDBJ/EMBL/GenBank databases. 


Fujino, Keio University School of Medicine, 


and 








ogy; Shinanomachi 35, Shinjuku-ku, 
ujino@microb.med.keio.ac.jp, 


Tokyo 1Le0=38582, 





Tel:+81-3 


iS 


-3353-1211 (ex.62692), Fax:+81-3-5360-1508) 
Location/Qualifiers 

1..2487 

/organism="Homo sapiens" 
/db_xref="taxon:9606" 

/sex="male" 

/cell line="HusS-L12" 

/cell_type="lung fibroblast" 
/dev_stage="embryo" 

229..2199 

/gene="PCCX1" 

229..2199 

/gene="PCCX1" 

/note="a nuclear protein carrying a PHD finger and 


domain" 

/codon_start=1 

/product="protein containing CXXC domain 1" 
/protein_id="BAA96307.1" 
/do_xref="G1I:8100075" 





/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD 


NCNEWFHGDC 





RDEGGGRKRP 





QIKRSARMCG 


FPSSLSPVTP 





EPLSDED iP 1 





IRITEKMAKAT 





'VPDPDLORRAG 














ECEACRRTEDC 








SESLPRPRRPL 





DPD 














KKSEKKKEER 














DDCGMKLAAN 








FHELEAI TLRAKOQQAVREDEE 











YESOTSFGSM 





YKRHROKOKHK 





RIYEILPORIOQ 


1YODFCAGAFDDHGLPWMSDTE 























Et 


REWYCRECREKDPK 








EIRYRHKKSRERDGNERDSSEP 

















SGTGVGAMLARGSAS PHKSS POPLVAT PSOQHHOQOQOO 








GHC DFCRDMKKFGGPNKIROKCR JRARESYRKY 





1ROCO 











PTQQOPOPSOKLGRIREDEGAVAS STVKEPPEATATP 





























ESPFLDPALRKRAVKVKHVKRR 


(ea 








DKWKHPERADAKDPASLPOCLGPGCVRPAQOPSSKYCS 























QWQOS PC TAEEHGKKLLERIRREQQSARTRLOEMERR 




















[7] 


SNEGDSDDTDLOTFCVSCGHPINPRVALRHMERCYAK 











YPTRIEGATRL 

















Ea 


FCDVYNPOSKTYCKRLOVLC PEHSRDPKVPADEVCGC 
































PLVRDVFELTGDFCRLPKROCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT 








BASE COUNT 
ORIGIN 








61 
lial 
131 
241 
// 


Ta TT HT 
Ta TT 
Ta TT 


564 a 


agatggcggc 
agacgacgca 
gaagtagttg 
egeqggtcgc 
aaaaaaaaaa 


aa ET 
aa aE aE 
aa EEE 

















AMTNRAGLLALM 
der ae 





HOTIQHDPLTTDLRSSADR" 
168 g 440 t 











cacctactgg 
gggaggcgtg 
ccgagtggtc 
gcggagatat 


cttgggggct ctaggccggc 
gcaataggag tacgctgcct 
tCgeaacegce tgggacgecg 
cgtgagggag tgcgccggga 
aaaaaaa 


gctgaggggt 
tggggcctgc 
EgggeGeett 


tggcgggggt 
aaaaaaaaaa 





Department 


Japan 


ebigeagegg 
actagaagcg 
tgtgeaggtt 
ggagggagat 
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The fil 


is recor 





nam 


d.raw 

















The date of the file is Sun Mar 30 14:30:39 2003 

The: filetype of the file 1s raw 

The contents of the file are: 
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGT TTGCAGCGGAGACGACGCATGGGGCC 
TGCGCAAT 
AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCC 
GAGTGGTC 
TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGT GCGCCGGGAGCGGAGATAT GGAGGGAGATAAAAAAA 
AAAAAAAA 

AAAAAAAAAAAA 

HHH EEE EH HEHE HE EE HHH 

HHH EE EH HH EH HE EE HEF 

HoH HEE EE EH HH EH HE EE HHH 

The file name is record.embl 

The date of the file is Sun Mar 30 14:31:23 2003 

The filetype of the file is _embl 

The contents of the file are: 

ID AB031069 2487 bp mRNA PRI 27-MAY-2000 Sequence 
severely 

truncated for demonstration. AB031069 ABO31069 

XX 

SQ sequence 267 BP; 65 A; 54 C; 106 G; 42 T; O other; 

AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG 
AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG 
GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT 
CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT 
AAAAAAAAAA AAAAAAAAAA AAAAAAA 

// 

HoH HEHEHE EH HE EH HE EE EHF 

HoH HEH EE EH HE EE HE EE HHH 

HEE HEE EH HH EH HE EE EHF 

The file name is record.fasta 

The date of the file is Sun Mar 30 14:31:40 2003 

The filetype of the file is fasta 

The contents of the file are: 

> ABO31069 2487 bp mRNA PRI 27-MAY-2000 Sequence 
severely 


truncated for demonstration. ABO31069 
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG 
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT 
GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC 
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT 
CGTGAGGGAGTGCGCCGGGAGC GGAGATAT GGAGGGAGA TAAAAAAAAAA 
AAAAAAAAAAAAAAAAA 





aaa aE aE aE ea a 
Hae aE aE ae aE Ta aE aE aa 
aaa aE ae aE aE aE a 


The file name is record.gcg 





The date of the file is Sun Mar 30 14:31:47 2003 
The filetype of the file is _gcg 
The contents of the file are: 





ABO031069 2487 bp mRNA PRI 27-MAY-2000 Sequence 
severely 
truncated for demonstration. ABO031069 

ABO031069 Length: 267 (today) Check: 4285 


1 AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG 
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51 TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT 
101 GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC 
151 TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT 
201 CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA 
251 AAAAAAAAAA AAAAAAA 


Haat ae aa aaa EE 
aaa aE aaa EE a 
Hara aE ae ae aE aaa a 





The file name is record.staden 


The date of the file is Sun Mar 30 14:32:01 2003 


The filetype of the fi 
The contents of the fi 
}<====AB0Z10G69===s=== > 


le is _staden 
le are: 


AGATGGCGGCGCT GAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGG 
AGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCG 
GAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTT 
CGCGGGTCGCT GGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGAT 
AAAAAAAAAAAAAAAAAAAAAAAAAAA 


aT aT TT ET 
aT a aaa a aT a 
aaa a aT aa Ta 
tae aT at ae ae ae at ae aE aE ae a 


Test file format ref 














HEEEEEE EH EH EH EH EH EEE 
At this point there ar 
HEHE SH 


Ftitee Testing put met 
HEHEHE HF 





Printing staden data i 





AGATGGCGGCGCT GAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCC 


TGCGCAAT 


AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGT TGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCC 


GAGTGGTC 


TGTGCAGGTTCGCGGGTCGCTGGCGGGGGT CGTGAGGGAGT GC GCCGGGAGCGGAGATAT GGAGGGAGATAAAAAAA 


AAAAAAAAA 

AAAAAAAAAAA 

Printing staden data i 
ID AB031069 ABO031069 
XX 

SQ sequence 267 BP; 

AGATGGCGGC GCTGAGGGGT 

AGACGACGCA TGGGGCCTGC 

GAAGTAGTTG TGGGCGCCTT 

CGCGGGTCGC TGGCGGGGGT 

AAAAAAAAAA AAAAAAAAAA 
// 








Printing staden data i 
> AB031069 


HEHEHE EE EE EE EE EE HEHE 
ormatting and writing 
HERE EE EERE EE EE EE HEHE 


e 8 objects. 


hods 


n raw format: 


n embl format: 


65 A; 54 C; 106 G; 42 
CTTGGGGGCT CTAGGCCGGC 
GCAATAGGAG TACGCTGCCT 
TGCAACCGCC TGGGACGCCG 
CGTGAGGGAG TGCGCCGGGA 
AAAAAAA 


n fasta format: 


T; O other; 

CACCTACTGG TTTGCAGCGG 
GGGAGGCGTG ACTAGAAGCG 
CCGAGTGGTC TGTGCAGGTT 
GCGGAGATAT GGAGGGAGAT 


AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG 
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT 
GGGAGGCGTGACTAGAAGCGGAAGTAGT TGTGGGCGCCTTTGCAACCGCC 
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT 
CGTGAGGGAGT GC GCCGGGAGCGGAGATAT GGAGGGAGATAAAAAAAAAA 
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AAAAAAAAAAAAAAAAA 


Printing staden data 


ABO3 


Pein 
LOCU 


1069 
ABO31 
1 
on 
101 
151. 
201 
251 


ting 
S 


069 Length: 
AGATGGCGGC 
TTTGCAGCGG 
GGGAGGCGTG 
TGGGACGCCG 
CGTGAGGGAG 
AAAAAAAAAA 


staden data 
ABO31069 


DEFINITION ABO31069 








ACCE 
ORIG 


SS. 
Py 


U 
kK 
= 
2} 


Pa 
HW 
7) 
K 


SSION 
IN 
1 
61 
121. 
181 
241 








OHEt 

DoH 

HH 
He 




















nun wD 

lh 1G 
Ss 
a 
Ps) 
K 

















4-— 
oy 


BP 


31 C 
61 A 
91, oF 
12d. 1G 
Lod. 
Lei, iC 
241, 1 
241 A 


agatggcggc 
agacgacgca 
gaagtagttg 
egcegggtcge 
aaaaaaaaaa 


staden 


in gcg format: 


267 (today) 
GCTGAGGGGT 
AGACGACGCA 
ACTAGAAGCG 
CCGAGTGGTC 
TGCGCCGGGA 
AAAAAAA 


in genbank 


Check: 
CTTGGGGGCT 
TGGGGCCTGC 
GAAGTAGTTG 
TGTGCAGGTT 
GCGGAGATAT 


format: 


267 (610 


, 267 bases, 


gctgaggggt 
tggggcctge 
EgggegeEctt 


tggcgggggt 
aaaaaaaaaa 


AB031069 
ABO031069 


#Molecular-weight 0 


PANDA, rPAHA 
PAAHNPTAY, YY 
PADADAAADAAH 
PANDPHAAARAAH 
PAMDAPRHTPrPADNA 
PAHAARAAAAA 
PAAAHAAAA 
PAMDAHAAARA 


NEXT > 


i 


PrPaAADHPrRrAAAG 
POQHAHAHAHADA 
PADRAADAAAPrA 
PAAAAAADAH 
PAQAQAPrPAPrKAANA 


829 sum. 


eltigqgqgggcE 
gcaataggag 
tgcaaccgoce 
cgtgagggag 
aaaaaaa 


data in PIR format: 


PPADAANAADAHTPrHYA 
PAMDHAAA,rA 
PPANRAAANAA 
PHADAMNDHHDA 
PrPaAaAHAHHAADA 


#Length 267 


4285 


CTAGGCCGGC 
GCAATAGGAG 
TGGGCGCCTT 
CGCGGGTCGC 
GGAGGGAGAT 


evaggcegge 
tacgetgcect 
tgggacgccg 
tgcgccggga 


2 


PHHAHDAADHO 
PAQAHHPAHA 
PADAMDMAAHH 
PPrPHAHAHOQH PHA 
PANAAY,rY,rrAA 


#Checksum 


CACCTACTGG 
TACGCTGCCT 
TGCAACCGCC 
TGGCGGGGGT 
AAAAAAAAAA 


cacetactg¢G 
gggaggcgtg 
ccoagtggtc 
gcggagatat 


4285 


2 


QP? © Geo @ a 

PaANAPrALY,rYrrTrA 

PrPaAMDAPYrAAN 
QAAADAANDAAA 


PPHaAAPrAAN 


Chigeageg¢G 
actagaagcg 
EgtgCcagGgtt 
ggagggagat 


SOOO @ Ho 
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4.4 Resources 


e Inheritance is a fundamental OO technique; see Section 3.14 for Perl OO reference 
material that includes discussion of inheritance. 


e For programming with sequence data file formats, see the C program readseq by Don 
Gilbert (at http://iobio.bio.indiana.edu/soft/molbio/readseq). 





e See the Bioperl project at http://www.bioperl.org for alternate ways to handle this 
programming task in Perl. 





e For amore rigorous but slower approach to parsing sequence files (which is 
sometimes what you want) see the module Parse: :RecDescent by Damien Conway 
at CPAN. 


e Each sequence file format has documentation that describes it. These formats 
sometimes change to keep up with the changing nature of biological data. One of the 
following exercises challenges you to find the documentation for one of the formats 
and to improve the code in this chapter for that format. 


[ Team LiB ] 
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4.5 Exercises 


Exercise 4.1 


Write an object-oriented module DNAsequence whose object has one attribute, a 
sequence of DNA, and two methods, get_dna and set_dna. Start with the code for 
Gene.pm, but see how far you can whittle it down to the minimum amount of code 
necessary to implement this new class. 


Exercise 4.2 


The FileIO.pm module implements objects that read and write file data. However, 
they can, depending on the program, deviate substantially from what are actually 
present in files on your computer. For instance, you can read in all the files in a folder, 
and then change the filenames and data of all the objects, without writing them out. Is 
this a good thing or a bad thing? 


Exercise 4.3 


In the text, you are asked why the new constructor for FileIO.pm has been whittled 
down to the bare bones. You can see that all it does is create an empty object. What 
functionality has been moved out of the new constructor and into the read and write 
methods? Does it make more sense to do without a new constructor entirely and 
instead have the read and write methods create objects? Try rewriting the code that 
way. Alternately, does it make sense to try rewriting the code so that both reading 
and writing are handled by the new constructor? Is creating an object sometimes 
logically distinct from initializing it? 


Exercise 4.4 


Use FileIO.pm as a base class for a new class that manages the annotation of a 
pipeline in your laboratory. For example, perhaps your lab gets sequence from your 
ABI machine, screens it for vectors, assesses the quality of the sequencing run, 
searches your local database to determine if you've seen it or something like it before, 
then searches GenBank to see what other known sequences it matches or resembles, 
and finally adds it to an assembly project. Each step has a person or persons, a 
timestamp for the beginning and ending of each phase, and data. You want to be able 
to track the work done on each sequence that emerges from your ABI. (This is just an 
example. Pick a set of jobs that you actually do in your lab.) 


Exercise 4.5 


For each sequence file format handled by the SeqFileIO.pm module, find the 
documentation that specifies the format. Compare the documentation with the is _, 
parse_,andput_ method to recognize, read, and write files in each format. How can 
you improve this code? Make it more complete? Faster? 


Exercise 4.6 


My parse_ methods are somewhat ad hoc. They don't really parse the whole file 
according to the definition of the format. They just extract the sequence and a small 
amount of annotation. Take one of the formats and write a more complete parser for 
it. What are the advantages and disadvantages of a simple versus a more complete 
parser in this code? How about for other applications you may want to develop in the 
future? 


Exercise 4.7 


Use the parser you developed in Exercise 4.6 to do a more complete job of identifying 
a file in the same format in the module's is_ method. 


Exercise 4.8 
Add a new sequence file format to SeqFilelIo. 
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Exercise 4.9 


In FileIO.pm, and in many other places in this book, the program calls croak and 
exits when a problem arises (such as when unsuccessfully attempting to open a file for 
reading). Such drastic measures are sometimes desirable; for example, you may want 
to kill the program if a security problem is discovered in which someone is attempting 
to read a forbidden file. Or, when developing software, you may like your program to 
print an informative message and die when a problem occurs, as that might help you 
develop the program faster. 


However, very often what you really want is for the program to notice the error and 
take some appropriate steps, not simply die. If a file cannot be opened, it may be 
something as simple as the user of the program mistyping the filename, and what 
you'd like is to give the user another couple of chances to type the name in correctly. 
Rewrite FileIO.pm without calling croak. This may entail checking for the success or 
failure of certain operations and taking reasonable actions on failure. Should the class 
module take all such actions, or should the program that uses the class module be 
expected to behave appropriately when a failure is reported? 


Exercise 4.10 


The AUTOLOAD method in FileIO.pm tests for attributes that are scalars and 
references to arrays. The need for this comes from the list of attributes given in the 

% my attribute properties hash. Each attribute hash value is an anonymous array 
with two elements: default value and properties. From the default value you can see 
that a value is either a scalar (a string in this case) or an anonymous array (a 
reference to an array). The code that AUTOLOAD installs for accessor routines then 
checks if the attribute is either a scalar or a reference to an array. 


This AUTOLOAD method is inherited by SeqFileIO.pm. One of the modifications that 
SeqFileIO.pm makes is defining its own % my attribute properties to handle the 
new attributes that it defines, such as sequence. In this case, all the attributes are 
either scalars or references to arrays, as before. What modifications are necessary if 
some other data type is needed for a new attribute by a class that inherited 
FileIO.pm? How can you rewrite FileIO.pm to make it easier to write classes that 
inherit it? 


Exercise 4.11 


The test program testSeqFileIo has certain shortcomings. For one thing, it repeats 
blocks of code that can be replaced with a short loop (with a little rewriting). Another 
problem is that it doesn't test everything in the class. 


Rewrite testSeqFileIO so that it's clearer and more comprehensive. By default, 
make it just give a short summary of the number of tests performed and the number 
of tests passed, but add a verbose flag so that it prints out all its tests in detail when 
desired. The module SeqFileIO.pm is lacking POD documentation.Add POD 
documentation to the module that is fairly easily cut and pasted into a test program 
for the module. 


Exercise 4.12 


In SeqFileIO.pm, the hash % all attribute properties changed from the base 
class and needed to be redefined. However, the code for the all attributes, 
_attribute default, and permissions helper methods didn't change. Why then did 
the new class SeqFileIo redefine these methods? (Hint: are these helper methods 
closures?) SeqFilelIO.pm is also lacking POD documentation. Try adding POD 
documentation to the module soy that it can be easily cut and pasted into a test 
program for the module. 


Exercise 4.13 


The h2xs program that ships with Perl simplifies module creation, and even helps you 
create the Makefile.PL that you'll need to add your own module to CPAN or to your 
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local installation (which helps you bypass the somewhat awkward use lib directive 
that appears in the programs in this book). See also the perlxstut, the 

ExtUtils: :MakeMaker, and the AUTOLOAD manpages. In particular, see the -X option 
to h2xs. Write a module starting from the use of h2xs. 





Exercise 4.14 


The open calls in the read methods of the classes in this chapter specify a filehandle 
FileIOFH. Alternatives include using lexical scalars as filehandles or the 10: : Handle 
package. Rewrite the read methods so files are opened with these alternative types of 
filehandles. What costs or benefits result from these rewritings? (See the perlopentut 
part of the Perl documentation.) 


Exercise 4.15 


In the AUTOLOAD method, a copy of the file data is returned from the get_filedata 
accessor; this will protect the actual file data in the object, but it makes a copy of a 
potentially very large amount of data, which can overtax your system. Discuss 
alternatives for this behavior, and implement one of them. 


Exercise 4.16 


caAT 
fey) 


Reading in a few hundred large files (as can easily happen with the modules in this 
chapter) can overtax your system, causing the system, or at least the program, to 
crash. Design two alternative methods that avoid this overuse of memory. For 
instance, you can avoid reading in a file until the sequence data is actually needed. You 
can also reread the data into the program each time needed but not save it in your 
object. Finally, you can reclaim memory from older files. Implement one of these 
methods or some other. What other parts of the code need to be altered? 


4 PREVIOUS 
NEXT > 
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Chapter 5. A Class for Restriction 
Enzymes 


In this chapter, you'll learn how to write an object-oriented class that handles restriction 
enzymes using the modules from the previous chapter as part of an interface to the 
Restriction Enzyme Database (Rebase). I'll develop a class that finds restriction sites in DNA 
sequence data. In my book Beginning Per! for Bioinformatics, I presented code that extracts 
information from Rebase and uses it to make restriction maps of DNA sequence data. In this 
chapter, I'll adapt and extend that software (or ideas from the accompanying exercises) in an 
object-oriented fashion. (All code is shown here and is available from this book's web site.) 
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5.1 Envisioning an Object 


The Rebase project provides a set of files that specify restriction enzymes, their cut sites, and 
a great deal more information. Consider the problem of designing an object-oriented version 
of code that uses this data. What will be the objects and the methods? 


Each restriction enzyme has a name; associated with its name are the definition of its 
recognition site (which I'll translate into a Perl regular expression), information about the 
chemistry of the restriction enzyme, vendors of the enzyme, and other annotation. This 
information is all part of the Rebase database. 


Perhaps I should consider each restriction enzyme as a suitable candidate for my basic object. 
I can then read in the Rebase database, creating objects for each restriction enzyme that 
includes such attributes as the recognition site, the translation of the recognition site into a 
Perl regular expression, and whatever additional annotation I find useful. 


With such objects, I can associate methods that take as their arguments sequence data and 
return the list of locations in which that particular enzyme has a recognition site in the 
sequence. Sounds good, let's start coding! 


But wait. What happens if, as is often the case, you want to find multiple restriction enzymes 
in a sequence and display the resulting map. With my design, you'd have to find the object 
associated with each restriction enzyme, pass it to the sequence, collect the locations, and 
then combine the individual lists of locations in order to display the map. This can be slow 
(finding the right objects, one for each restriction enzyme) and inconvenient (combining the 
output of the various methods from the various objects). 


You recognize this questioning as an essential step in program designthinking about the 
problem and considering alternative ways to write code that solve it. I reprise the idea here 
because, so far, I've been simply seeing and discussing solutions. Although it's neat and tidy, it 
isn't really the way programming works. Programming often involves thinking of alternative 
program strategies, comparing them, coding the most promising alternatives as prototypes 
and testing them (i.e., benchmarking), and finally deciding on an approach to implement. 


So, in that spirit, what alternatives come to mind to the one enzyme/one object approach 
just described? The Rebase database is essentially a key/value lookup database, in which the 
key is the enzyme name. The value is the recognition site or annotation: actually there are 
several datafiles provided in the database. But I'm most interested in getting the recognition 
site, translating it to a Perl regular expression, and reporting on the locations in some 
sequence data. A nice interface to display some of the annotation of the restriction enzyme 
would also be useful. 


Any key/value type of data immediately brings the hash data structure to the mind of the Perl 
programmer. As you know from my introduction to object-oriented programming, the hash 
data structure is also the most useful way to implement an object. 


So, perhaps instead of many objects, one for each restriction enzyme, you may want to 
consider one object that provides the fast lookup of a value (the recognition site and regular 
expression) for each key (the name of the restriction enzyme). Clearly, this can be 
implemented as a hash. Other attributes can hold the sequence and the map as an array of 
the positions in the sequence in which the recognition sites exist. Methods for the object could 
extract the site, the regular expression, and perhaps some annotation, for each enzyme. A 
method can also locate the recognition sites for an enzyme in the sequence. 


If we go that way, how will we manage the actual restriction maps that are made? A 
restriction map has as input some sequence and a list of restriction enzymes, and has as 
output a list of the locations where the enzymes have recognition sites in the sequence. 
Should there be another kind of object, a Restriction object, that has attributes of 
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sequence, enzyme names, and locations of recognition sites? 


Perhaps we can use the SeqFileIo class from Chapter 4 as a base class for a new derived 
class that adds attributes for restriction maps on the sequence. 


That might be possible, but it combines file manipulations with restriction mapping and seems, 
at best, a shotgun wedding. 


So, after careful reflection, consultation with colleagues, a lab meeting, pressure from the PI, 
an opinion from an outside expert, and some quick and dirty Perl scripts to see some 
alternatives in action, a decision is reached. We'll make a big Rebase object to hold the 
enzyme/recognition site data, plus a new Restriction object that holds the sequence and 
the locations of the recognition sites (the "map"). The class will provide the methods needed 
to calculate a restriction map. 


One of the considerations that led to this decision was that, at some point, it will be 
necessary to graphically display the restriction map; an object that contains the sequence and 
the map (the locations of the recognition sites in the sequence) will be well suited for adding 
some graphics capabilities. 


Finally, in this chapter we'll use the Restriction class as a base class to develop a 
Restrictionmap class that does have some graphics capabilities. 


For more discussion of how to design the component parts of this software development 
project, see the exercises at the end of the chapter. 
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5.2 Rebase.pm: A Class Module 


Here is a very simple interface to the Rebase data contained in the bionet file that is part of 
its distribution: 


package Rebase; 


n 


# A simple class to provide access to restriction enzyme data from Rebase 
# including regular expression translations of recognition sites 


use strict; 
use warnings; 
use Carp; 

use DB File; 





Class data and methods 


# A hash of all attributes with default values 


fo) 


my % attributes = ( 





_rebase = of hy 
# key = restriction enzyme name 
# value = space-separated string of sites => regular 
expressions 
_bionetfiile => 2", 
_dbmfile => '??', 
_mode => 0444, 


Ve 


# Return a list of all attributes 
sub _all_ attributes { 
keys % attributes; 


} 


# Return the value of an attribute 
sub attribute value { 
my ($self,$attribute) = @ ; 
$_attributes{Sattribute}; 


} 


Notice that the opening block is considerably pared down, compared to earlier classes. For 
instance, I've tossed the code that keeps count of all objects. Why? Because it's unlikely that 
more than one of these objects will be necessary in a program: so why bother? 


5.2.1 Attributes: Short and Sweet 


Notice that the list of attributes is short: 

_rebase 
A hash that will be populated to provide the lookup, with enzyme names for keys, and 
recognition sites (and their translation to regular expressions) for values. (Make sure 


you see how in the hash %_ attributes the value of the key _rebase is itself an 
anonymous hash.) 


_bionetfile 


The name of the datafile from the Rebase distribution. In my examples, I use the 
version numbered bionet.212, and by the time you read this book, more recent 
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versions will be available (you can get bionet.212 from this book's web site). 
_dbmfile 


The DBM filename that resides on disk and stores the data in the hash _rebase.™ 


[1] 


so can help you avoid the cost of recalculating a hash each time a program is run. For more 


information, see O'Reilly's Programming Perl, the documentation for the DB_ File module, and the 


documentation for the dbmopen and tie functions. 


_mode 


Contains the permissions with which you will create, or attempt to read, the DBM file. 


This is important for security purposes. 


With so few attributes, the class methods can easily handle each attribute individually, without 


recourse to the use of AUTOLOAD to define various accessors and mutators, as seen in 
previous chapters. 


5.2.2 Creating a Rebase Object 


Here's how a Rebase object is created and initialized: 


# The constructor method 

# Called from class, e.g. $obj = Rebase->new( dbmfile => 'DBMFIL 
sub new { 

my (Sclass, Sarg) = @ ; 








1A 
~~ 
x 





# Create a new object 
my Sself = bless { }, Sclass; 





# DBM file must be given as "dbmfile" argument 
unless(Sarg{dbmfile}) { 
croak("No dbm file specified"); 





# Set the attributes for the provided arguments 
foreach my Sattribute (S$self->_all_attributes( )) { 








# E.g. attribute = "name", argument = "name" 
my(Sargument) = (Sattribute =~ /* (.*)/); 








# Initialize to defaults 
Sself->{Sattribute} = $self->_attribute value (Sattribute); 





# Override defaults with arguments 
if (exists Sarg{Sargument}) { 
if(Sargument eq 'rebase') { 
croak "Cannot set attribute rebase"; 





} 
Sself->{Sattribute} = Sarg{Sargument}; 





# Open or create the DBM file 


Recall that DBM files are tied to hashes in Perl and provide a simple, easy-to-program database for 
key/value pairs. They serve as a way to keep a hash on disk between invocations of a program, and 


unl 





Sself-> 








E 


ss(tie %{$self->{ rebase}}, 'DB File', Sarg{dbmfile}, O RDWR|O CREAT, 


{ mode}, SDB HASH) { 
my Spermissions = sprintf "slo", S$self->{_mode}; 
croak "Cannot open DBM file Sarg{dbmfile} with mode Spermissions"; 


} 





# If "bionetfile" argument given, calculate the hash from the bionet fil 
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if(Sarg{bionetfile}) { 


# Empty the hash 
s{Sself->{_rebase}} = ( ); 





# Recalculate the hash 
$self->parse rebase; 


} 


return Sself; 


} 





# For this simple class I have no AUTOLOAD or DESTROY 





# No get_rebase method, I don't want to pass around a huge hash 


# No "set" mutators: all initialization done by way of "new" constructor 





The new constructor method is short. It requires a DBM database filename (existing or new), 
to which the hash data structure _rebase is tied. If the DBM file doesn't exist, the pathname 
of the bionet Rebase file is also required (that's where the data comes from that populates 
the DBM file). If a bionet datafile is given, the method calls the parse _rebase method that 

parses the bionet file to create the rebase hash. 


As the comments indicate, my class is so simple I've even decided to do away with AUTOLOAD 
and DESTROY, and I've dispensed with the set mutators as well. 


5.2.3 Methods for the Rebase Class 


Now, let's continue by looking at the methods for the Rebase class. Given an enzyme, the 
following two methods, get_ recognition sites and get_regular_expressions, retrieve 
the enzyme's recognition sites and the translations of the recognition sites into regular 
expressions. 





How do these two methods work? One method returns all recognition sites for an enzyme as 
given in the Rebase database; the other returns all the translations of the recognition sites 
into regular expressions. They both work very similarly. First, the enzyme is looked up in the 
_rebase hash. The value for each enzyme in the _rebase hash is a space-separated string 
that alternates recognition sites with their regular-expression translations. In both methods, 
the space-separated string is split to get a list of alternating recognition sites and regular 
expressions. This list is then assigned to the hash sites to populate it with keys as 
recognition sites (the data that's actually in the Rebase bionet file) and values as regular 
expressions. The Perl operators keys and values are then used to generate the list of 
recognition sites (keys) or regular expressions (values). 


The get_bionetfile, get_dbmfile, and get_mode methods just report on the arguments 
that are set to specific filenames or mode strings when the object was created: 


sub get_regular_expressions { 
my(S$self, Senzyme) = @ ; 








my(ssites) = split(' ', $self->{_rebase}{Senzyme}) ; 
# May have duplicate values 


return values %sites; 


sub get_recognition sites { 
my(S$self, Senzyme) = @ ; 








my(ssites) = split(' ', Sself->{_rebase}{Senzyme}) ; 
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return keys %sites; 
sub get_bionetfile { 

my(S$self) = @ ; 

return Sself->{_ bionetfile)}; 
sub get_dbmfile { 

my ($self) = @ ; 

return $self->{_dbmfile}; 


sub get_mode { 
my(Sself) = @ ; 





return $self->{_ mode}; 


} 
5.2.4 parse_rebase 


The workhorse method of the class is parse_rebase, which reads the bionet Rebase datafile 
(with a suffix that indicates the release version, such as bionet.212). The bionet input 
datafile begins like this: 


REBASE version 212 bionet.212 





























REBASE, The Restriction Enzyme Database http://rebase.neb.com 














Copyright (c) Dr. Richard J. Roberts, 2002. All rights reserved. 
Rich Roberts Dec 01 2002 
Aaal (XmalIIl1) C*GGCCG 
AacI (BamHT) GGATCC 
AaelI (BamHT) GGATCC 
AagI (ClaT) AT*CGAT 
AaqiI (ApaLT) GTGCAC 
Agel CACCTGCNNNN* 

Aarl “NNNNNNNNGCAGGTG 
AasI (DrdtTI) GACNNNN*NNGTC 
AatI (Stull) AGG*CCT 

AatIl GACGT*C 

Aaul (Bsp1407T) T*GTACA 


As you can see, the header information ends with a line containing Rich Roberts. Apart from 
a blank line or two, the rest of the file contains records, one per line. Each record begins with 
a restriction enzyme name, optionally followed by another enzyme name in parentheses. The 
last field of each line is the recognition site. These are given using IUB codes for nucleotides. 
(For the IUB codes, see the comments in the program.) 


They also contain cut sites, indicated by a caret symbol *. Cut sites contribute very important 
information about a restriction enzyme; they show where the enzyme makes the break 
when it cuts the DNA. Among other things, they are needed to correctly perform restriction 
digests in the computer when determining if there are overhangs that will be useful when 
inserting vectors or otherwise reassembling the fragments. However, in the code here we'll 
ignore the cut sites taking as a goal the virtual fingerprinting of DNA by just locating the 
recognition sites. Cut sites are omitted to simplify the code. (See the exercises for more on 
handling cut sites.) 


sub parse rebase { 
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my($self) = @ ; 


# handles multiple definition lines for an enzyme name 
# also handles alternat nzyme names on a line 














# Read in the bionet(Rebase) fil 
unless (open (BIONETFH, S$self->{_bionetfile})) { 
croak ("Cannot open bionet file Sself->{_bionetfile}"); 





} 


while (<BIONETFH>) { 











my @names = ( ); 


# Discard header lines 


























next if (1 .. /Rich Roberts/ );# discard all lines from the first 
line 
# to the first line containing "Rich 
Roberts" 
# Discard blank lines 
next unless /\S/; # discard a line unless it contains something not 
# whitespace 
Split the two (or three if includes parenthesized name) fields 
my @fields = split; 
Get and store the recognition site 
my Ssite = pop @fields; 
For the purposes of this exercise, I'll ignore cut sites (%). 
This is not something you'd want to do in general, however! 
Ssite =~ s/\*//97 
Get and store the name and the recognition site. 
Add alternate (parenthesized) names 
from the middle field, if any 
foreach my Sname (@fields) { 
Sname =~ tr/)(//d; # delete parentheses 
push @names, Sname; 
} 
# Store the data into the hash, avoiding duplicates (ignoring * cut 
sites) 


# and ignoring reverse complements 

# Because these values are stored via DBM, I cannot use anything but 
# a scalar string to store the site/regularexpression pairs, 
space-separated 

(but s th xercises) 

foreach my Sname (@names) { 

















# Add new enzyme definition 
unless(exists Sself->{ rebase}{$name}) { 


Sself->{_rebase}{$name} = "Ssite " . IUB_to regexp (S$site) ; 
next; 

} 

my(sdefined_ sites) = split(' ', $self->{_rebase}{$name}) ; 





# Omit already defined sites 
if(exists S$defined_sites{$site}) { 
next; 
# Omit reverse complements of already defined sites 
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}elsif (exists S$defined_sites{revcomIUB(S$site)}) { 
next; 
# Add the additional site 
telse{ 
S$self->{_rebase}{$name} .= " Ssite " . IUB _to_regexp($site); 





} 
} 
} 


return 1; 


} 


This subroutine is a bit complex, corresponding to the nature of the data that it's processing. 
For instance, because enzymes can appear on more than one line, it has to check if an 
enzyme was already entered as a key in the hash. 


Let me remind you of the range operator that is used here to skip header lines: 


# Discard header lines 
next if (1... /Rich Roberts/ ); # discard all lines from the first line 
# to the first line containing "Rich 
Roberts" 


The expression (1 .. /Rich Roberts/ ) returns true (and leads to the line being skipped) 
only when the line being read is included in the range bordered by the first line and the first line 
containing the regular expression /Rich Roberts/. (See the perlop section of the Perl 
manual for all the details on the range operator.) 


The parse _rebase subroutine, after skipping the header and any blank lines, then processes 
each data line in a while loop. Each line is split into either two fields (name and recognition 
site) or three fields (name, parenthesized alternate name, and recognition site). The name or 
names are placed in the @names array and looped through. In the last foreach loop, if the 
enzyme name hasn't yet been defined in the DBM-tied hash, it is added as a key. The value 
assigned to the key is a string with recognition site followed by a translation of the recognition 
site to a regular expression. The program passes to the next name. 


If the enzyme name has previously been entered as a key, the previously entered recognition 
sites are examined, and if the new site is there, the program passes to the next name. 
Similarly, if the reverse complement of the site has been entered, the program passes to the 
next name. But otherwise (if the enzyme name was entered, but neither the site nor its 
reverse complement were in the list of sites for that enzyme), the recognition site is added 
with its translation to a regular expression. 


This method has to handle reverse complements of recognition sites. Many restriction 
enzymes are palindromic in the sense that their reverse complements are the same. (For 
instance, the reverse complement of GAATTC is GAATTC.) These biological palindromes 
indicate that the enzyme can cut the site on both strands. 


tes) Although the code presented here ignores cut sites, the exercises will ask 
you to reconsider them; note that if the cut site isn't exactly in the 
middle, there will be "sticky ends" of single stranded DNA that make it 


possible to anneal the fragment with a complementary sticky end. Refer 
to a standard molecular biology textbook for the essential biology of 
restriction enzymes. (The logic used here to handle reverse complements 
might not be ideal for all situations: see the exercises.) 





5.2.5 Methods to Translate Nucleotides to Regular Expressions 


Finally, the remaining methods translate IUB-coded nucleotide sequence data to Perl regular 
expressions. They also perform reverse complementation on IUB-coded sequence data. 


sub revcomIUB { 
my($seq) = @ ; 
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my Srevcom = reverse complementIUB(S$seq) ; 





return Srevcom; 


sub complementIUB { 
my($seq) = @ ; 


(my Scom = Sseq) =~ tr [ACGTRYMKSWBDHVNacgtrymkswbdhvn] 
[TGCAYRKMWSVHDBNtgcayrkmwsvhdbn] ; 





return Scom; 
Translate IUB ambiguity codes to regular expressions 
IUB to regexp 


A subroutine that, given a sequence with IUB ambiguity codes, 
utputs a translation with IUB codes changed to regular expressions 
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sub IUB_to_regexp { 
my(Siub) = @ ; 
my Sregular expression = ''; 


my Siub2character class = ( 


A => 'A', 
C => 'C', 
G=> 'G', 
T=> 'T', 
R => '[GA]', 
¥ => (Cr, 
M s> TPACT", 
K => '[(GT]', 
S => *[Gc]", 
y = Y Ae. ; 
B => '[(CGT]', 
D => '[{AGT]', 
H => '[ACT]', 
V => '[ACG]', 
N => '[ACGT]', 
); 
# Remove the * signs from the recognition sites 





Siub =~ s/\*//g; 


# Translate each character in the iub sequence 
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for (my Si = 0; Si < length(Siub) ; ++Si ) { 
$regular_ expression 
.= Siub2character class{substr(S$iub, $i, 1)}; 


} 


return S$regular_ expression; 


1; 


=headl Rebas 





Rebase: A simple interface to recognition sites and translations of them into 
regular expressions, from the Restriction Enzyme Database (Rebase) 








=headl Synopsis 
use Rebase; 


# Use "bionetfile" to create and populate dbm file 
my Srebase = Rebase->new ( 

domfile => 'BIONET', 

bionetfile => 'bionet.212', 

mode => 0644 








i 


# Use without "bionetfile" to attach to existing dbm file 
my Srebase = Rebase->new ( 

dbmfile => 'BIONET', 

mode => 0444 








\e 

















my Senzyme = 'EcoRI'; 

print "Looking up restriction enzyme Senzyme\n"; 

my @sites = Srebase->get_recognition_sites ($enzyme) ; 

print "Sites are @sites\n"; 

my @res = $rebase->get_regular_ expressions ($enzyme) ; 

print "Regular expressions are @res\n"; 

print "DBM file is ", $rebase->get_dbmfile, "\n"; 

print "Rebase bionet file is ", Srebase->get_bionetfile, "\n"; 
=headl AUTHOR 





James Tisdall 

=headl COPYRIGHT 

Copyright (c) 2003, James Tisdall 
=cut 


5.2.6 Testing the Module 


Ending the module, as usual, is some POD documentation for the module. Recall that you can 
view the output of this documentation in various ways, as HTML on a web page, as 
PostScript, etc. However, the simplest way is to say the following at the command line: 


perldoc Rebase.pm 


Let's try running the sample code given in the documentation. Notice that the Rebase.pm 
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module is available, as is the bionet.212 file from the Rebase distribution. Also, notice there 
are two alternate calls to Rebase->new, so you should comment out the first one, then the 
other, in tests. Save the sample code from the documentation in a file called testRebase, and 
when you run it with the command: 


perl testRebas 





you get the following output: 





Looking up restriction enzyme EcoRI 
Sites are GAATTC 

Regular expressions are GAATTC 

DBM file is BIONET 

bionet file is bionet.212 


4 PREVIOUS 
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5.3 Restriction.pm: Finding Recognition Sites 


The time has now come to write the class that creates an object out of a sequence, enzyme 
name(s), and the map of the location(s) of the enzyme recognition sites in the sequence. 


This module depends a great deal on the module Rebase developed in the previous section, 
but it is fairly short because it just tries to do a small job. This new Restriction class takes 

a Rebase object (which has the Rebase database translated to regular expressions), some 
sequence, and a list of enzymes, and uses the regular expressions to find the recognition sites 
in the sequence. (Note that it doesn't use inheritance; it simply creates a Rebase object to 
use.) 


In this module, the restriction map (the list of locations where the enzymes have recognition 
sites in the sequence) is obtained through an auxiliary method map_enzyme that simply lists 
the locations. Clearly, a more graphical display would be easier and more useful. I'll consider 
that as the book progresses. 


5.3.1 The Restriction.pm Module 


Here is the Restriction.pm module: 


package Restriction; 


A class to find locations of restriction enzyme recognition sites in 
DNA sequence data. 





ise strict; 
use warnings; 
use Carp; 





# Class data and methods 
{ 

# A list of all attributes with default values. 

# "enzyme" is given as an argument possibly multiple time, set as key to 
_map hash 
my % attributes = ( 

ia 




















ebase => { }, # A Rebase.pm hash-based object 
# key = restriction enzyme name 
# value = space-separated string of recognition sites => regular 
expressions 
_sequenc => '', # DNA sequence data in raw format (only bases) 
_map => { },# a hash: keys ar nzyme names, 
# values are arrays of locations 
_ enzyme => '', # space- or comma-separated enzyme names, 
# 


set as key to _map hash 
\; 


# Global variable to keep count of existing objects 
my $ count = 0; 


# Return a list of all attributes 
sub _all_ attributes { 


keys % attributes; 


} 


# Manage the count of existing objects 
sub get count { 
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$count; 
} 
sub (incr count { 
++5 count; 
} 
sub decr_ count { 
--$_ count; 





} 
} 


This opening block shows that Restriction is also a class with a small and simple set of 
attributes. One is just the DNA sequence data, sequence. 


The attribute map is a hash that stores the computed restriction map with a key for each 
restriction enzyme and a value consisting of an array of the positions in the sequence where 
recognition sites for that enzyme occur. 


The attribute enzyme is one or more enzyme names separated by spaces or commas. 


The remaining attribute rebase is an object of the class I developed in the last section, the 
class Rebase. Recall that a Rebase object gives you the ability to get regular expressions for 
recognition sites of restriction enzymes. When you call Restriction->new you must include 
as one of its arguments a Rebase object, as is shown in the example given in the POD 
documentation later in this section. 


Since a program can have many restriction maps, I also maintain the count of Restriction 
objects. 


5.3.1.1 Initializing Restriction objects 


Next, the new constructor method creates and initializes Restriction objects: 


# The "new" constructor method, called from class, e.g. 
sub new { 

my ($class, %args) = @ ; 

# Create a new object 

my Sself = bless { }, Sclass; 





# Set the attributes for the provided arguments 








foreach my Sattribute ($self->_all_attributes( )) { 
# E.g. attribute = " name", argument = "name" 
my (Sargument) = (Sattribute =~ /* (.*)/); 





if (exists Sargs{Sargument}) { 








if(Sargument eq 'enzyme') { 
# permit space or comma separated enzyme names 
Sargs{Sargument} =~ s/,/ /g; 


} 
Sself->{Sattribute} = Sargs{Sargument}; 





} 


# Check that the correct arguments are given 
if( not defined $self->{_rebase} ) { 
croak "A Rebase object must be given as an argument"; 





}elsif( ref ($self->{_rebase}) ne 'Rebase' ) { 
croak "The argument to rebase is not a Rebase object"; 
}elsif( not defined $self->{_ sequence} ) { 





croak "A sequence must be given as an argument"; 


} 


# Calculate the locations for each enzyme, store in map hash attribute 
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foreach my Senzyme (split(" ", Sself->{_enzyme})) { 
$self->map_enzyme (Senzyme) ; 





} 
$self->_incr count; 


return Sself; 





For this simple class I have no AUTOLOAD or DESTROY 





# No get_rebase method, I don't want to pass around a huge hash 


No set mutators: all initialization done by way of "new" constructor 








No clone method. Each sequence and set of enzymes can be easily calculated 
by means of a "new" command. 


5.3.1.2 The methods explained 





The new constructor method checks that the proper and required arguments are provided: a 
sequence, enzyme(s), and a Rebase object. 


Then the new constructor performs the required computation by calling the method 

map enzyme for each enzyme to determine the (possibly empty) array of locations in which 
the enzyme can cut the sequence. These enzyme-array key/value pairs are stored in the 
_map attribute. 


As you can see in the next segment of code from Restriction.pm, the method map_enzyme 
works by getting the regular expressions that have been calculated for the enzyme and then 
finding the positions where the regular expressions match the sequence. 


The get_ regular expressions method accesses the Rebase object that was passed to the 
Restriction->new constructor method and appears as the $self->{_rebase} attribute. In 
that object, the attribute _rebase is the hash of the database, which looks for the value of 
the key Senzyme. 


The match positions method puts a match of a regular expression in a while loop, where it 
loops once for each location it finds. The offset of the matching sequence is available in the 
special variable $- [0] after a successful regular-expression match (see the perlvar section 
of the Perl documentation for more details). 
sub map _enzyme { 

my(S$self, Senzyme) = @ ; 





my(@positions) = ( ); 





my(@res) = $self->get_regular_ expressions ($enzyme) ; 


foreach my Sre (@res) { 
push @positions, $self->match_positions ($re); 


} 


@{Sself->{_map}{Senzyme}} = @positions; 
return @positions; 


sub get_regular expressions { 
my(S$self, Senzyme) = @ ; 








my(ssites) = split(' ', Sself->{_rebase}{_rebase} {Senzyme}) ; 


# May have duplicate values 
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} 


return values %sites; 


# Find positions of a regular expression in the sequence 


sub 


} 


match positions { 
my ($self, Sregexp) = @ ; 


my @positions = ( ); 


# Determine positions of regular expression matches 
while ( $self->{_ sequence} =~ /Sregexp/ig ) { 
push @positions, ($-[0] + 1 ); 





} 


return (@positions) ; 


Finally, here are some helper methods that report on the attributes of sequence and of the 
calculated map for a given enzyme (or the entire hash that contains the map of all the 
enzymes). I didn't use AUTOLOAD here because there are only a handful of attributes that 
return a few different data types: array, scalar string, and hash. 


sub 


sub 


sub 


sub 


} 


get_enzyme map { 
my(Sself, Senzyme) = @ ; 





@{Sself->{_ map} {$Senzyme}}; 





get_enzyme names { 
my ($self) = @ 


id 


keys %{$self->{_map}}; 


get_sequence { 
my(S$self) = @ ; 


S$self->{ sequence}; 
get_map { 
my(S$self) = @ ; 


S{$self->{_map}}; 


5.3.1.3 Documentation 


Here's the (bare bones) documentation for the class embedded right in the module using the 


POD documentation language: 


=headl Restriction 


Restriction: Given a Rebase object, sequence, and list of restriction enzyme 





names, return the locations of the recognition sites in the sequence 


=headl Synopsis 


use Restriction; 
use Rebase; 


use strict; 
use warnings; 





my Srebase = Rebase->new ( 
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dbmfile => 'BIONET', 
bionetfile => 'bionet.212' 


i 








my Srestrict = Restriction->new ( 
rebase => Srebase, 
enzyme => 'EcoRI, HindIII', 





sequence => 'ACGAATTCCGGAATTCG', 


i 


print "2 


wns 





W ' 


Locations for EcoRI are ", join(' ', 








Srestrict->get_enzyme map(' 


=headl AUTHOR 


James Tisdall 


=headl COPYRIGHT 


Copyright (c) 2003, James Tisdall 


=cut 


il 


Saving the Synopsis example in a file and running it (making sure the required bionet.212 


file is in place) gives the output: 


EcoRI data 
Sequence is 





in Rebase is GAATTC GAATTC 
ACGAATTCCGGAATTCG 





Locations for EcoRI are 3 11 
4 PREVIOUS 





ECORI')), 
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5.4 Drawing Restriction Maps 


One of the most important lessons of scientific programming is the importance of a good 
display of a program's results. 


In this section, I'm going to add the ability to output a restriction map by inheriting 
Restriction.pm and making a new derived class Restrictionmap.pm. The restriction map 
will be shown very simply as the sequence printed in lines of simple text. The locations of 
restriction sites are written over the lines of text, giving the names of the restriction enzymes 
at the restriction sites. In Chapter 8, this simple text-based graphic output is replaced by a 
real picture (with colors, different fonts, and whatever bells and whistles you choose to add). 
I designed my base class Restriction.pm to represent the restriction map as a simple list of 
recognition site locations because I wanted my software to be flexible enough to be extended 
to accommodate any of the many different graphic formats that might be desired. 


The difference between an unreadable mass of data, and a good clean graphic display that 
leads the eye towards an interesting result, is profound. It's the difference between a 
successful program and a dud, between hours or days spent sorting through columns of data 
and a quick discovery of a region of interest. It's even the difference between a scientific 
discovery, and none at all. 


Graphics programming is a bit of an advanced topic. (You'll get your feet wet in Chapter 8.) 
But even if you need a program that's restricted to text output, you still need to spend 
programming time displaying that output in a useful manner. So, in this section, I'll add a very 
simple map drawing capability to the software, drawn as simple text. 


5.4.1 Storing Graphics Output in an Attribute 


I want to display a graphic. Do I need to add _ graphic and _graphictype attributes to my 
object? The question boils down to: shall I compute and display a graphic whenever needed, 
or shall I compute a graphic and store it in an attribute? If you're new to computer graphics, 
just think of a graphic as a mass of data, which can be stored in a variable, or in a file on disk, 
and can be used to generate a graphical display on the computer screen by the right 
software. 


I do already compute the restriction map, by which I mean the actual locations of the 
recognition sites for each enzyme requested, and stored it in the map attribute. I can also 
compute a graphic at the same time and store it in the proposed graphic attribute. 


Let's think about the pluses and minuses of storing graphics output in an attribute. 


I'm not sure which graphics system I'll use in the future; for now, a simple text output may 
work as a Stored attribute, but there are a lot of graphics outputs possible. Also, it seems 
likely that I'll be able to compute the graphics output very quickly, so the need for storing it is 
less compelling. And storing a large image for possibly hundreds or thousands of objects will 
be a strain on the computer system. 


However, I may not be able to calculate quickly for fancier, full color, high resolution graphics 
that I might want in the future. And perhaps I'll need to flip between graphics very quickly, in 
which case having them precalculated would be a necessity! 


Also, how about multiple digests? Should I just create one graphic for the entire set of 
enzymes in an object or add a single digest graphic for each enzyme? 


Maybe I can't answer all these questions yet, either. At this point, I'm not sure of the kind of 
graphics I'll be producing, or of their size, or of the time to generate them. Perhaps I can 
decide to make just one graphic per object (the user can always make a new object with just 
one enzyme if desired). 
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Answering these questions is a judgment call. When the answers to such questions aren't 
known, you should at least try to make the code flexible enough to accommodate whatever 
decisions are eventually reached. 


For now, let's add a graphics attribute and precalculate the graphics output. After all, if in the 
future I decide that, for instance, I can generate graphics quickly enough on the fly, but that 
they're a bit large, I'll just add a method that can generate the images when needed. 


5.4.2 The Restrictionmap Class 


Here's a simple extension of the Restriction class I'll call Restrictionmap, which adds a 
new _mapgraphics attribute. It doesn't compute the graphics output until the first time it's 
called; it then stores the result for future lookups: 


package Restrictionmap; 


use bas ( "Restriction" ); 





n 
# A class to find locations of restriction enzyme recognition sites in 
# DNA sequence data, and to display them. 


use strict; 
use warnings; 
use Carp? 





Class data and methods 


# A list of all attributes with default values. 


fo) 


my % attributes = ( 








# key = restriction enzyme name 

# value = space-separated string of recognition sites => regular 
expressions 

_rebase => { }, # A Rebase.pm object 

_ sequence => TN, # DNA sequence data in raw format (only 
bases) 

_enzyme => '', # Space separated string of one or more 
enzyme names 

_map => { }, # hash: enzyme names => arrays of locations 

_graphictype => 'text', # one of 'text' or 'png' or some other 

_graphic => '', # a graphic display of the restriction map 


)e 


# Return a list of all attributes 
sub _all_ attributes { 
keys % attributes; 
} 
} 


Notice that I've declared a new class Restrictionmap using the base class Restriction. 
Notice also that no AUTOLOAD mechanism is provided. 


Also, in the block containing the class data and methods, there's a new definition of the hash 
% attributes that adds the new attribute graphictype to indicate the type of graphic (at 
first, this will be "text"), and the new attribute graphic as a simple scalar. 


Why am I defining it as a scalar? As you'll see, the routines that make the text-based graphics 
that I'm adding here assemble the graphics output as an array. To store it as a scalar requires 
an extra call to the Perl join function. 


The reason I'm storing the graphics display as a scalar is that image data comes in a variety 
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of formats. If I plan on just saving an image as a scalar, it will perhaps cover the most 
common possibilities; it's always possible to read a file into a scalar in Perl. Hopefully, this 
decision to store the image data as a scalar gives the most flexibility for future development. 


This opening block is similar to the opening block of the base class Restriction. Notice the 
only helper method I'm redefining is the all attributes method, which uses the redefined 
% attributes data structure that appears in the same block with it. The other methods in 
this block in the base class count the number of objects in existence during the running of the 
program; those methods and the count variable they depend on are inherited by this new 
derived class Restrictionmap and will work fine. 


5.4.2.1 Adding graphics capability to the class 


Next, come the methods that add the graphics capability to the class: 


sub get graphic { 
my ($self) = @ ; 


# If the graphic is not stored, calculate and store it 
unless ($self->{_graphic}) { 


unless ($self->{_graphictype}) { 
croak ‘Attribute graphictype not set (default is "text")'; 
} 


# if graphictype is "xyz", method that makes the graphic is 
" drawmap xyz" 
my Sdrawmapfunctionname = " drawmap_" . Sself->{_graphictype}; 


# Calculate and store the graphic 
Sself->{ graphic} = $self->$drawmapfunctionname; 


} 


# Return the stored graphic 
return $self->{_ graphic}; 


} 


This is the public method that is meant to be called from a program. Recall that I decided to 
only calculate the graphics for an object the first time the graphic is requested; it's then saved 
in the mapgraphic attribute for subsequent calls. 


5.4.2.2 Creation of the graphic 


The first step in graphic creation is an initialization routine. This doesn't do much here, but 
other graphics I might create may well benefit from a separate initialization step, so I plan 
ahead and add it here as well. 


Each enzyme is called in turn, and its matching positions are retrieved as you've seen before 
with the get_enzyme_map method. The appropriate annotation is added to the @annotation 
array. The resulting annotation is then formatted for output and returned: 

# 


# Methods to output graphics in text format 
# 


sub _drawmap text { 
my($self) = @ ; 


my @annotation = (); 











push(@annotation, initialize annotation _text ($self->get_sequence) ); 
foreach my Senzyme (Sself->get_enzyme names) { 
_add_annotation _text( ° annotation, Senzyme, 
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S$self->get_enzyme map ($enzyme) ); 
} 


# Format the restriction map as sequence and annotation 
my @output = _formatmaptext(50, $self->get_sequence, @annotation) ; 





# Return output as a string, not an array of lines 
return join('', @output); 


} 


# Make a blank string of the same length as the given sequence string 
sub (initialize annotation text { 
my($seq) = @ ; 





return '' x length(S$seq) ; 


The idea behind this text-based annotation is as follows. The lines of sequence output are 
separated by blank lines just for readability. Directly above the sequence lines are additional 
annotation lines containing the names of enzymes, which appear with their names starting 
exactly where the recognition site starts. Sometimes, enzyme names collidetwo or more 
enzyme names might require the same space on an annotation line. In that case, an 
additional annotation line is created to print these overlapping enzyme names. There's an 
example of this in the program output following this discussion. 


The code is fairly well commented. Check out the exercises: one will challenge you to 
improve it or to start from scratch and invent a better way to display this kind of text-based 
annotation. I'll omit a detailed commentary on exactly how this code accomplishes its job; as 
they say, it will be left as an exercise for the reader. Or, as one of my mathematics 
professors used to delight in saying, "A moment's reflection should clear the matter up for 
you." 

# Add annotation to an annotation string 

sub _add_annotation_text { 


my(Sarray, Senz, @pos) = @; 
# Sarray is a reference to an array of annotations 
# Put the labels for th nzyme name at the correct positions in the 


annotation 
foreach my S$location (@pos) { 





# Loop through all the annotation strings as necessary 
for( my $i = 0; Si < @Sarray ; ++S$i ) { 


# If the annotation contains only space characters at that 


position, 
# insert the annotation 
if (substr(S$Sarray[$i], $S$location-1, length(Senz)) eq (' ' x 
length (Senz))) { 
substr($Sarray[$i], $location-1, length($enz)) = $enz; 
lasic; 


# If the annotation collides, add it to the next annotation 
string on the 

# next iteration of the "for" loop. 

# But first, if there is not another annotation string, make one 


}elsif (Si == (@Sarray - 1)) { 


push(@S$array, _initialize annotation text ($Sarray[0])); 


} 
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} 


# Sequence with annotation lines formatted for the page with line breaks 
sub _formatmaptext { 


my(S$line length, $seq, @annotation) = @ ; 


my(@output) = (); 


# Split strings into lines of $line length 
for ( my S$pos = 0 ; Spos < length($seq) ; $pos += $line length ) { 
# Print annotation on top of sequence, using reverse 
foreach my $string ( reverse (S$seq, @annotation) ) { 
# Discard blank lines? 
# if ( substr(S$string, $pos, $line length) !~ /[* \n]/ ) { 
# next; 
tf 





# Add line to output 

push(@output, substr($string, S$pos, $line length) . "\n"); 
} 
# separate the lines 
push (output, "\n"); 





} 


# Return the merged annotation and sequence 
return @output; 


} 
=headl Restrictionmap 
Restrictionmap: Given a Rebase object, sequence, and list of restriction 


enzyme 
names, return the locations of the recognition sites in the sequence 








=headl Synopsis 
use Restrictionmap; 
use Rebase; 


use strict; 
use warnings; 





my Srebase = Rebase->new ( 
domfile => 'BIONET', 
bionetfile => 'bionet.212' 





Vi 


my Srestrict = Restrictionmap-—>new ( 
rebase => Srebase, 
enzyme => 'EcoRI', 
enzyme => 'HindIII', 
sequence => 'ACGAATTCCGGAATTCG', 
graphictype => 'text', 








\e 





print "Locations are ", join ' ', Srestrict->get_enzyme map('EcoRI'), 
WA Ms 


print Srestrict->gét_graphic; 


=headl AUTHOR 
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James Tisdall 

=headl COPYRIGHT 

Copyright (c) 2003, James Tisdall 
=cut 


1; 
5.4.2.3 Running the program 


If you run the short program given in the Synopsis section of the documentation, you won't 


get very interesting output; the input data is too short. The following is a better example. 


There's some DNA sequence data that contains EcoRI and HindIII recognition sites, saved in a 


file called sampleecori.dna. The class SeqFilelIo is used to read it in, and then 
Restrictionmap makes and displays a restriction map of that sequence. 


Here's the program, an extension, really, of the short program that appears in the Synopsis 


section of the POD documentation. Let's use it to test this new Restrictionmap Class: 


use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 


use Restrictionmap; 
use Rebase; 
use SegqFileI0O; 


use strict; 
use warnings; 











3 


y Srebase = Rebase->new ( 
dobmfile => 'BIONET', 
bionetfile => 'bionet.212', 
mode => '0666', 

); 








my Srestrict = Restrictionmap-—>new ( 
rebase => Srebase, 
enzyme => 'EcoRI HindIII', # GAATTC # AAGCTT 





sequence => 'ACGAATTCCGGAATTCG', 
graphictype => 'text', 
Ve 





print "Locations are ", join ' ', Srestrict->get_enzyme_map('EcoRI'), 


print $réstrict->get_graphic; 





## Some bigger sequenc 





my Sbiggerseq = SeqFilelO->new; 
#Sbiggerseq->read(filename => 'map.fasta'); 
Sbiggerseq->read(filename => 'sampleecori.dna'); 








my Srestrict2 = Restrictionmap-—>new ( 
rebase => Srebase, 
enzyme => 'EcoRI HindIII', # GAATTC # AAGCTT 





sequence => Sbiggerseq->get_sequence, 
graphictype => 'text', 





); 


print "\nHere is the map of the bigger sequence:\n\n"; 





Wh: 
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print Srestrict2->get_graphic; 


Notice that a use lib directive is added to tell Perl where to find the modules; you can 
accomplish the same thing with the PERL5LIB environmental variable or with a command-line 


argument to Perl. 


Here's is the output from the test program: 


Locations are 3 11 
EcoRI EcoRI 
ACGAATTCCGGAATTCG 











Here is the map of the bigger sequenc 


AGATGGCGGCGCTGAGGGGTCT TGGGGGCTCTAGGCCGGCCACCTACTGG 


TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT 


EcoRI 
GGGAGGCGTGACTAGAAGCGGGAATTCAAGTAGTTGTGGGCGCCTTTGCA 





ACCGCCTGGGACGCCGCCGAGT GGTCTGTGCAGGTTCGCGGGTCGCTGGC 


HindIII 
GGGGGTCGTAAGCTTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAG 


ATGGTTCAGACCCAGAGCCTCCAGAT GCCGGGGAGGACAGCAAGTCCGAG 


AATGGGGAGAATGCGCCCATCTACTGCATCTGCCGCAAACCGGACATCAA 


AindlL 
CTGCTTCATGATCGGGTGTGACAAGCTTAACTGCAATGAGTGGTTCCATG 


GGGAC TGCATCCGGATCAGCGGGATGGCAATGAGCGGGACAGCAGTGAGC 


CCCGGGATGAGGGTGGAGGGCGCAAGAGGCCTGTCCCTGATCCAGACCTG 


CAGCGCCGGGCAGGGTCAGGGACAGGGGTTGGGGCCATGCTTGCTCGGGG 


EcoRI 
CTCTGCTTCGCCCCACAAATCCTCTCCGCAGCCCTTGGTGGCCACGAATT 





CACCCAGCCAGCATCACCAGCAGCAGCAGCAGCAGATCAAACGGTCAGCC 


EcoRI 
Hanae TT 
AAGCTTGAATTCCGCATGTGTGGTGAGTGTGAGGCACCAGTGACGCCCTC 





AGAGTCCCTGCCAAGGCCCCGCCGGCCACTGCCCACCCAACAGCAGCCAC 


EcoRI 
AGCCATCACAGAAGT TAGGGAATTCGCGCATCCGTGAAGATGAGGGGGCA 





EcoRI 
GTGGCGTCATCAACAGT CAAGGAGCCTCCTGGAATTCAGGCTACAGCCAC 





ACCTGAGCCACTCTCAGATGAGGACCTA 


As you see, the sequence is printed out formatted for the page with the names of restriction 


enzymes appearing above their recognition sites. 
[+ Previous] 
i 
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5.5 Resources 


The primary resource for this section is the Restriction Enzyme Database web site 
found at http://www.neb.com/rebase. 





See the discussion of making restriction maps in O'Reilly's Beginning Perl for 


Bioinformatics. 
Team LiB 
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5.6 Exercises 


Exercise 5.1 


Why use the object-oriented approach for the interface to the Rebase database at all? 
What are the benefits and detriments of going to the object-oriented style? 


Exercise 5.2 


The Restriction.pm module uses another module in a new way. Instead of inheriting 
the Rebase.pm class, it requires that a Rebase object be passed to the constructor 
Restriction->new to become one of the attributes of the Restriction object. 


Consider alternative ways to write this code. Can Restriction inherit Rebase and 
achieve the same functionality? If so, write the code. Or, can the same functionality be 
achieved by some method that avoids having a Rebase object passed as an argument 
to a Restriction object? If so, write the code. 


Exercise 5.3 


Go to CPAN and read the documentation about the MLDBM module. It allows you to 
use a DBM file to store and retrieve complex data. Rewrite the Rebase.pm module to 
use MLDBM and replace my use of space-separated strings of recognition sites and 
regular expressions. 


Exercise 5.4 


As discussed in the text, there are some interesting considerations involved in parsing 
the data that relates to how the restriction enzymes actually work, such as handling 
reverse complements of recognition sites and cut sites. The logic used here to handle 
reverse complements might not be ideal for all situations. Review carefully the logic of 
the parse _rebase subroutine. Can you find any problems its logic might cause when 
you try to use the software to support a particular experiment? 


Exercise 5.5 


It would be nice to be able to ask some method in Restriction.pm if a particular 
restriction enzyme produces sticky ends at its cut site. It would also be useful to know 
what other enzymes create sticky ends that will anneal with the sticky ends of this 
enzyme. Check to see if this information appears in any of the datafiles of the Rebase 
database. Can you design a method that returns this information, given the name of a 
restriction enzyme? What changes do you have to make to your database; do you 
need any more datafiles from the Rebase distribution? 


Exercise 5.6 


Describe in detail how the logic for map_enzyme works. Can you devise a different way 
to accomplish the same thing? 


Exercise 5.7 


The code in this chapter uses the class Restriction as a base class for the class 
Restrictionmap which lets you make a graphic display of the restriction map. Would it 
be a better idea just to add the graphics capabilities to the Restriction class instead 
of inheriting it into a new class? Rewrite Restriction to add the graphics capability to 
it. What are the pros and cons of these two different ways of writing and organizing 
the code? 


Exercise 5.8 


In the method formatrestrictionmap, some lines of code are commented out that 
shorten the output by not printing extra blank lines. Try it out both ways. (And may 
God have mercy on your souls.) Do you think it makes the output less lengthy at the 
expense of making it more difficult to read? What is the tradeoff here? Do you prefer 
the longer or shorter version? Defend your preference. 
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Exercise 5.9 


Add position numbers to the output of Restrictionmap. Add the position of the first 
base in each line or the position of each restriction enzyme. 


Exercise 5.10 


The drawmap_text method of the Restrictionmap Class is a bit lengthy and 
involved. See if you can improve the method. Either alter the code in the book or start 
from scratch. Improve it by making it faster, simpler, or easier to read. Try making its 
output better or add options to make the output more flexible. Try any combination of 
the above. 


Exercise 5.11 


String copying is a great way to slow down a program. Consider the code I gave for 
the following subroutine: 
sub complementIUB { 

my($seq) = @ ; 


(my Scom = Sseq) =~ tr [ACGTRYMKSWBDHVNacgtrymkswbdhvn] 
[TGCAYRKMWSVHDBNtgcayrkmwsvhdbn] ; 


return $com; 


} 


Explain why the subroutine is written in this somewhat slow way. Now, rewrite this 
subroutine to eliminate a string copy. (Extra challenge: there are actually two string 
copies here. Rewrite the subroutine another way to eliminate a string copy. Can you 
eliminate both string copies? Why or why not?) Also, what's with those square 
brackets around the arguments to the tr function? 


Exercise 5.12 
Consider the following lines from the subroutine IUB_ to regexp: 
# Remove the * signs from the recognition sites 
Siub =~ s/\*//g; 
This operation is redundant because the caret “ was removed from the recognition 
site in the subroutine parse _rebase. Why is it included here? 
Exercise 5.13 


Consider the following last two lines from the subroutine map_enzyme in 
Restriction.pm: 


@{Sself->{_ map} {Senzyme}} = @positions; 
return @positions; 


How does the subroutine behave differently if the first line is changed to: 
$self->{_map}{Senzyme} = \@positions; 


Why does the subroutine return the array @positions since the return value isn't used 
in any of the code and the positions are saved in the object anyway? 


Exercise 5.14 


There is a difference in behavior and readability between the looping constructs 
for(;;) and for( ) or its synonym foreach( ). Try writing some small test 
programs that use these different loops and time them using the Perl modules 
Benchmark or Devel: :DProf. Clearly, for and foreach are most useful when iterating 
through arrays, and for(;;) is most useful when iterating through numbers. 
However, there are places in the code presented in this chapter in which for(;;) 
iterates through an array using a scalar variable as a subscript counter (as $i in 
Sarray[$i].) Try finding and rewriting such loops using foreach; benchmark the two 
versions. 


4 PREVIOUS 
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Chapter 6. Perl and Relational Databases 


Relational database systems are extremely important in all kinds of computingcommercial as 
well as scientific. The Perl programmer can perform most database manipulations from Perl 
programs using modules written for this purpose. I'll briefly review database lore and then 
concentrate on an introduction to the Perl modules that provide an interface to relational 
databases. 


This and the remaining chapters of this book will continue to look at fundamental Perl topics 
but with this difference: these topics rely on Perl modules, not on new Perl syntax. The 
reason for this is a deliberate decision by the Perl language designers to keep the language 
itself fairly small and to move as much functionality as possible into modules. This decision is 
interesting and important, and it has the practical effect of making modules quite important in 
Perl. 


First, I'll provide a quick explanation of database terminology and acronyms. Since I'll be 
discussing only relational databases, I sometimes say database to mean relational database. 
(They aren't synonymous in general, however.) I say DBMS (or database management 
system(s)) to refer to the software that provides database capabilities, such as MySQL or 
Oracle. Database or relational database refers to the definition and implementation of a 
particular collection of data in a DBMS, such as the examples I show later in the chapter. 
These terms, database for the data itself, and database management system for the 
software system that handles the data, are often used informally and interchangeably, and it's 
usually clear what is meant. 
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6.1 One Perl, Many Databases 


There comes a time when disk files or the simple DBM hash database (that you've seen in 
previous chapters) just won't manage the data of a medium- or large-size project, and you 
must turn to relational databases. Although they take quite a bit more effort to set up and to 
program, they offer a standard and reliable way to store data and to ask questions about it. 


There are two things that make relational databases standard. For one thing, they all follow a 
certain model of data structures, the relational model. These data structures have become a 
fixture in the computing world; they combine a level of constraint and flexibility that has 
proved its usefulness in many areas, including bioinformatics. 


Almost all relational databases are programmed with a programming language called the 
Structured Query Language, or SQL. This is a fairly simple language that creates, populates, 
queries, and manages the kind of data structures relational databases provide. The 
combination of a standard data structure with a standard programming language is another 
reason relational databases have become so successful. 


One thing that's not standard is the proliferation of relational database companies and their 
penchant for doing things their own way. This may sometimes be a marketing decision, but 
it's more often the natural process of evolutionof different sets of programmers having 
different ideas and making different implementations. 


"1 When I first released software that used Perl for bioinformatics, I received a letter arguing that because C 


was available everywhere and Perl wasn't, Perl for bioinformatics was therefore a Bad Idea and I should use C 
instead. Of course, Perl is available everywhere now, including on the VMS systems that my correspondent 
was complaining about, and bioinformatics software is written in a variety of languages. He made the classic 
mistake of wanting to standardize a field long before it had settled down. 


This is important when you have some working database application that uses a particular 
DBMS such as Oracle, and you find that you have to port the application to work on another 
DBMS such as MySQL. Perhaps another database system has become significantly faster or 
cheaper, or your computer is replaced with a new one that supports a different database, or 
your computer center or CIO decrees that some new DBMS is now the mandatory standard. 
If your database application makes extensive use of a feature that is available only on your 
old DBMS, you'll have a lot of work ahead of you rewriting your software to make it work on 
the new DBMS. 


Luckily, thanks to some expert Perl programming, there is a way to get around this 
proliferation of different DBMS with their special ways of doing things and their special 
extensions of SQL. In this chapter, I'll use the Perl DBI (DataBase Independent) module that 
provides a common interface to different relational database systems; it makes it possible to 
write SQL that will run on many different relational database systems with little or no change. 


Still, unless you are subject to a decree, the problem that the Perl bioinformatics programmer 
faces at the beginning of a project is, "Which relational database system should I use?" It 
depends on the computer you're on and what DBMS is already in use, available, paid for, or 
known locally. There are very expensive systems, and there are free ones: we'll take a quick 
look at some of the alternatives and use one of the most popular free ones for the following 
examples. 


The beginning programmer should be aware that relational databases are a large field of 
endeavor. Stop at any local bookstore with a good computer book section and you'll see an 
impressive number of books dedicated to relational databases and SQL in general, and 
especially dedicated to working with specific relational database management systems such 
as Oracle, SQL Server (the database, not the language), Sybase, MySQL, etc. There are 
books devoted to specific tools for designing a database, managing a database system, and 
programming a database system. In the workplace, there are job titles and positions for 
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people who specialize in these three areas, and more. 


So, don't expect this one chapter to reveal all. Do expect it to explain the basic concepts, give 
you the lay of the land, and demonstrate a practical example you can use as a template as 
you begin to develop your own code. At the very least, you will want access to the 
documentation for the particular database system you will use in your own work. 


Having given the obligatory warning, I'll also add that, unless you are tackling a fairly large 
project, relational databases aren't all that difficult to use. If they were, they wouldn't be so 
popular. You may well spend a lot of your programming time in the future dealing with 
databases, or you may just spend a little. If you expect it will be a lot, I recommend you do 
some further reading. 


Relational databases are an important topic. Because they have their own software systems, 
language, and concepts, they are a bit of an additional challenge. Most bioinformaticians need 
to know the basics of how to use them; some specialize in them. This chapter will get you 


started. 
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6.2 Popular Relational Databases 


Relational database management systems are big business. They account for a significant 
segment of the computer business. Large companies often use them to manage their internal 
affairs and their sales data. Universities manage their internal affairs, and their research 
projects with them. Various governments use them extensively, from the national to the local 
levels. 


The industry leader at present for high-end systems is Oracle. Along with other DBMS vendors 
such as IBM, Microsoft, Sybase, Informix, and more, they provide large database systems 
that can handle large amounts of data, and many queries against that data, very quickly. 
They also provide design and management tools for the programming and support staff a 
large institution needs to maintain a database system. 


Unfortunately, these very nice software systems are also very expensive. They have hefty 
price tags themselves, and they require high-end computer systems to run on. 


From the top-of-the-line systems on down, there are many vendors and price ranges in the 
database marketplace. 


The main contenders in the free DBMS marketplace are mSQL, MySQL, and PostgresSQL. 
These systems are all progressing steadily, even rapidly, in their abilities. I use MySQL in this 
book, but the reasons aren't terribly important. The code I write in Perl and SQL runs with 
little need for change on any of these systems. If you're in the position of actually having to 
decide on an DBMS to install on your computer, you can find the information you need on the 
home web pages for these systems. 


MySQL is my DBMS of choice because it's free; suitable for small- to medium-size database 
projects; and runs on most operating systems found in the lab, such as Mac, Windows, and 
Unix/Linux. It lacks some features major systems have, but it has enough of them and is 
implemented well enough that many businesses and many research laboratories have found it 
quite suitable for their work. Approximately three million servers have it installed. On balance, 
it has very good performance.” 


a MySQL's multithreading helps account for its good performance, but it lacks some of the more advanced 


features of other DBMS available. Competition between these DBMS is keen, however, and there has been a 
certain amount of jockeying for bragging rights as performance and feature sets are improved. 


The details of MySQL are available, and you can get a free copy of it from 


http ://www.mysgl.org. 
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6.3 Relational Database Definitions 


A relational database is essentially a set of tables (relations) that are interrelated in certain 
ways. A table is a two-dimensional matrix with a name. Each table is composed of rows ( 
tuples), and no two rows can have exactly the same values. Each row is composed of named 
fields (attributes); each field has a certain data type and a name, and a field can only contain 
one value. 


'] vs the name "relation" suggests to the mathematically inclined, each table represents a subset of the 


Cartesian product of the domains of the fields, in which each row is an element of the relation. However, the 
order of the fields in a table is not significant (because each field has a unique name), and that's an important 
departure from the standard set-theoretic definition of a relation. 


For instance, a table may be defined with the fields Name as a character string of at most 50 
characters, an ID as an integer, and a Date as a special date datatype. Each row then has a 
Name, ID, and Date value. Table 6-1 and Table 6-2 show a table called genename with three 
fields and three rows, and one called organism with two fields and five rows. 





Table 6-1. genename 
Name ID 


aging 118 


wrinkle 9223 1987-08-15 
hairy 273 1990-09-30 


Table 6-2. organism 


1984-07-13 











human 118 
mouse 273 
worm 118 





You'll notice that the tables each have a field with the same set of values as the other; ID in 
Table 6-1 and Gene in Table 6-2 use values from the same domain. (A typical domain is the 
set of positive integers, for example.) Such shared fields can be used to join information from 
two or more tables. For instance, given a gene Name from the Table 6-1, I can find its 
associated ID, and look for that value in the Gene field of the Table 6-2 to find what organism 
or organisms have a version of that gene. 


In Table 6-1, the wrinkle gene has ID value 9223, and in Table 6-2, there are two entries in 
field Gene with value 9223, namely human and mouse. 


Each row in a relation is unique and can therefore serve as its own unique identifier for the 
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row. There may also be some smaller group of fields that, together, are unique for each row 
and for which removing any field destroys the uniqueness; such a group is a candidate key. 
Very often tables are defined so that each row has its own unique ID field (it may be called 
something besides ID) that alone may serve as a key; this usually is recommended. 


In any event, some candidate key is designated as the primary key for the table. 


If in another table there is a field with the same domain as the primary key, it can relate the 
information in the two tables. That field, in the other table, is called a foreign key, and the 
primary key's table is called the foreign key's home relation. In MySQL, foreign keys aren't 
specified as such; they are used only in joins, as you'll see. 


In defining fields, you may also specify that no row has a null (undefined) entry in a field; this 
is required for primary key fields. Similarly, a foreign key is required to refer to an actually 
existing value in some field in its home relation. These constraints on the data are known as 
entity integrity and referential integrity. 


Using some SQL statements, you can write a (very short) program that, given a gene name, 
returns the list of organisms in which a version of that gene is found (in this database). In 
fact, I'll do just that in the next section. 


That's simple enough. However, there is a bit more to learn about designing tables that work 
well, implementing the tables in SQL, and writing the (Perl) programs that send SQL 
statements to the DBMS and that compute and display the results. These areas of database 
design, implementation, and application development, are often staffed by specialists in large 
projects. Each specialization has its own techniques and lore; each can be a full-time job, and 
there are even more areas of specialization than these. Most bioinformatics programmers 
know enough about each area to design and implement a web-based, database-driven 
interface to their laboratory, which is the basic skill set that I'm attempting to impart. 
Assuming you are a beginner interested in learning the ropes, the following sections and 
chapters will hopefully give you enough of a jump start to be able to tackle all these tasks for 
a small project and prepare you to attempt bigger jobs. 
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6.4 Structured Query Language 


The Structured Query Language (SQL, pronounced "s q |" or "see quel") can be thought of as 
the working definition of a relational database. It provides the bioinformatics programmer with 
the wherewithal to create, populate, interrelate, query, and update a relational database on a 
computer system. Your DBMS comes with its own implementation of SQL, which will have all 
the basic commands plus some, or maybe all, the less-used commands and features in the 
standard definition of the language, perhaps even some special extensions to the standard. 


SQL dates back to the 1970s when it was developed at IBM. The most widely used versions 
of SQL are based on the standard published in 1992 and commonly called SQL2. A newer 
standard called SQL3 is available and supports emerging database functionality such as 
object-oriented and object-relational data models. MySQL is based on a subset of the most 
commonly used parts of SQL2, with the goal of providing a very fast implementation of the 
key components of SQL. Some features of SQL3 are also being added. 


SQL is actually a fairly simple language to learn. Most people find that getting an account 
established on their computer, reading through a quick tutorial, and then having example code 
to copy and modify with the SQL documentation close at hand, is enough to get started 
writing useful SQL code. 


I'm not going to present an extensive SQL tutorial here, for three reasons. First, such tutorials 
are easily and widely available. Second, each DBMS has its own version of SQL, so the DBMS 
documentation (such as that which comes with MySQL, for example) is necessary and 
available to you anyway. Third, SQL is such a basically simple language that it's quite useful to 
learn the basics of it by simply seeing a few examples. That's the approach I'll take. 


If you are new to SQL, the best way to get familiar with it is by using the interactive 
command-line interface to try out different commands. The following section demonstrates 
my Linux system running MySQL. 


6.4.1 SQL Commands 


First, I enter the interactive mysql program, providing my MySQL username ("tisdall") and 
interactively entering my MySQL account password: 

[tisdall@coltrane tisdall]$ mysql -u tisdall -p 

Enter password: 

Welcome to the MySQL monitor. Commands end with ; or \g. 

Your MySQL connection id is 2 to server version: 3.23.41 





Type 'help;' or '\h' for help. Type '\c' to clear the buffer. 
Next, I ask for a list of all the databases that are defined in my MySQL DBMS: 


mysql> show databases; 





caudyfly 
dicey 
gadfly 
master 
mysql 
poetry 
yeast 








7 rows in set (0.15 sec) 


6.4.1.1 Creating a database 
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I want to create a database called "homologs". First, I create it, then I check that it's there, 


and finally I make it the active database with use homologs;: 


mysql> create database homologs; 
Query OK, 1 row affected (0.00 sec) 


mysql> show databases; 


caudyfly 
ALely 
gadfly 
homologs 
master 
mysql 
poetry 
yeast 








8 rows in set (0.01 sec) 


mysql> use homologs; 
Database changed 


6.4.1.2 Creating tables 


The next commands create the two tables for the homologs database. Initially they are 
empty. I ask to see the fields that have been defined with show fields (show full columns 





also works): 

mysql> create table genename ( name char(20), id int, 

Query OK, 0 rows affected (0.00 sec) 

mysql> create table organism ( organism char(20), gene char(20) 


Query OK, 0 rows affected (0.00 sec) 


mysql> show tables; 


+-------------------- + 
| Tables in homologs | 
+-------------------- + 
| genename | 
| organism | 
$-------------------- - 


2 rows in set (0.00 sec) 


mysql> show fields from genename; 























4------- 4---------- 4------ 4----- 4--------- 4------- - 
| Field | Type | Null | Key | Default | Extra | 
4------- 4---------- 4------4----- 4--------- 4------- - 
| name | har (20). | YE | | NULL | | 
| id [ anti) | YE | | NULL | | 
| date | date | YE | | NULL | | 
4------- 4---------- 4------ 4----- 4+--------- 4------- - 




















4+---------- 4+---------- 4+------ 4+----- 4+--------- 4+------- 
| Field | Type | Null | Key | Default | Extra 
4---------- 4---------- 4------4+----- 4--------- 4------- 
| organism char (20) YE | NULL 

| gene | char(20) | YE | | NULL | 
4---------- 4---------- 4+------ 4+----- 4--------- 4------- 


2 rows in set (0.00 sec) 


date date 
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mysql> 
6.4.1.3 Populating the tables 


Now, I've got a new database with two tables defined, and I'm ready to populate the tables. 
First, I verify that the genename table is empty by making a select command; then I issue 
three insert commands, one for each row that I want to insert in the table. After inserting the 
rows, I verify that the genename table now has the desired three rows by means of a select 


command: 


mysql> select * from genename; 
Empty set (0.00 sec) 








Query OK, 1 row affected (0.00 sec) 











('wrinkle', 9223,'1987-08-15'); 
Query OK, 1 row affected (0.00 sec) 


mysql> insert into genename (name, id,date) 





Query OK, 1 row affected (0.01 sec) 


mysql> select * from genename; 


4--------- 4------ 4------------ + 
| name | id | date | 
4+--------- 4------ 4------------ - 
| aging | 118 | 1984-07-13 | 
| wrinkle | 9223 | 1987-08-15. | 
| hairy | 273" | LeSo=08=30° || 
foo ---H- fos n- $o----------- + 


mysql> insert into genename (name, id,date) 


mysql> insert into genename (name, id,date) 


values 


values 


values 


('aging',118,'1984-07-13'); 


("Heaiey" 2737 1 990=09=30" 3 


Now, I repeat the same process to populate the other organism table: 





mysql> show fields from organism; 

4---------- 4---------- 4------4----- 4--------- 
| Field | Type | Null | Key | Default 
4---------- 4---------- 4------4}----- 4+--------- 
| organism | char(20) | YE | | NULL 

| gene | char(20) | YE | | NULL 
+---------- +---------- +------ +----- +--------- 
2 rows in set (0.00 sec) 

mysql> insert into organism ( organism, gene 
Query OK, 1 row affected (0.00 sec) 

mysql> insert into organism ( organism, gene 
Query OK, 1 row affected (0.00 sec) 

mysql> insert into organism ( organism, gene 
Query OK, 1 row affected (0.01 sec) 

mysql> insert into organism ( organism, gene 
Query OK, 1 row affected (0.00 sec) 

mysql> insert into organism ( organism, gene 














Query OK, 1 row affected (0.00 sec) 








mysql> select * from organism; 


4---------- 4+------ - 
| organism | gene | 
+---------- +------ + 
| human | 118 | 
| human | 9223~ | 
| mouse | 9223: | 
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Let's find out which organisms have a homolog of the wrinkle gene. My query has two 
stages. First, I get the ID of the gene and search for it in the ORGANISM table. Then I write it 


as a single SQL statement: 


mysql> select id from genename where name = 'wrinkle'; 





1 row in set (0.00 sec) 


mysql> select organism from organism where gene = 9223; 


2 rows in set (0.00 sec) 





mysql> select organism from organism, 
-> where genename.name = 'wrinkle' 
+---------- + 
| organism | 
+---------- + 
| human | 
| mouse | 
+---------- + 


2 rows in set (0.00 sec) 
mysql> 


mysql> 


Notice how the last statement asks the same question as the two preceding statements 


combined. 


4 PREVIOUS 


cei 
Be 


genename 


and genename.id = organism.gene; 
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6.5 Administering Your Database 


Database administration encompasses such tasks as installing and configuring the DBMS, 
backing up the data, adding users and setting their various permissions, applying updates or 
new capabilities to the system, and similar tasks. If you just have yourself and a fairly small 
lab to deal with, it's not too bad. But organizations often hire one or more database 
administrators to do this work full time; even a smallish project, if it's critical and the budget 
exists, can benefit from the attention of a professional database administrator. 


If you are a beginning programmer and need to install and maintain a database management 
system, you'll need to read the manuals and learn the tools. Even if you have some computer 
administration experience, there is a bit of learning involved. The best thing you can do is get 
help from an experienced database administrator. 


Failing such expert help, it's necessary to get good documentation and follow it. This depends 
on the system you're using, of course. The following sections describe some of the basics. 


6.5.1 Adding Users 


One function a database administrator needs to know is how to add users and set their 
permissions, that is, what operations they're allowed to perform, and what resources they're 
allowed to view or change. In MySQL, for example, each user needs an account name and a 
password for access (these aren't tied to the rest of the account names and passwords on 
the computer system). Security can be important as well. You may use the database to 
manage your new data and results, which you don't want to release to the public just yet; at 
the same time, you may be providing the public, through a web site, access to your more 
established data and results. A system such as MySQL provides several tools to set up and 
manage accounts and security permissions. 


6.5.2 Backup and Reloading 


One essential task for any computer system's effort, including working with databases, is to 
back up your work. All computers will break; every disk drive will crash and become 
inoperable. If you don't have timely backups of your data, you will certainly lose it. 


There are many ways to back up data; even MySQL has more than one method. However, 
even if you back up your data from the database to a backup file, it's still necessary to make 
a copy of the backup file in another location that's not on the same hard disk. For the small 
to medium project, it's possible to run a program that simply makes a text file containing 
MySQL commands that repopulates your database. This is often a convenient and workable 
method, but check the MySQL documentation if you wish for alternatives. 


Here, then, is how you can make a backup, or dump, of a database, in this case to the disk 
file homologs.dump: 


[tisdall@coltrane development]$ mysqldump homologs -u tisdall -p > 
homologs.dump 

Enter password: 

[tisdall@coltrane development] $ 





After that command, a dump file called homologs.dump is created. Here's what it looks like for 
my little two-table database: 


[tisdall@coltrane development]$ cat homologs.dump 
# MySQL dump 8.14 


# 

# Host: localhost Database: homologs 

# ee ee a 
# Server version S241 
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# 
# Table structure for table 'genename' 


# 








CREATE TABLE genename ( 
name char(20) default NULL, 
id int(11) default NULL, 
date date default NULL 
) TYPE=MyISAM; 



































# 

# Dumping data for table 'genename' 

# 

INSERT INTO genename VALUES ('aging',118,'1984-07-13'); 
INSERT INTO genename VALUE ('wrinkle', 9223,'1987-08-15'); 
INSERT INTO genename VALUE (*hairy*,273,.'1990=09=30" ) ; 

# 

# Table structure for table 'organism' 

# 














CREATE TABLE organism ( 
organism char(20) default NULL, 
gene char(20) default NULL 

) TYPE=MyISAM; 





























- 

# Dumping data for table 'organism' 

- 

INSERT INTO organism VALUES ('human','118'); 
INSERT INTO organism VALUES ('human','9223'); 
INSERT INTO organism VALUES ('mouse!,'9223'); 
INSERT INTO organism VALUES ('mouse!,'273'); 
INSERT INTO organism VALUES ('worm','118'); 
[tisdall@coltrane development] $ 








Once you've backed up your data, you can drop the database (wiping out the data) and then 
create the (empty) database again. You can then use your dump file to redefine the tables 
and repopulate them with your saved rows of data. 


Here's how you might delete a database and reload it from your dump backup file 
homologs.dump: 


mysql> drop database homologs; 
Query OK, 0 rows affected (0.00 sec) 


mysql> show databases; 


caudyfly 
ALCEY 
gadfly 
master 
mysql 
poetry 
yeast 








7 rows in set (0.00 sec) 


mysql> create database homologs; 
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Query OK, 1 row affected 


mysql> use homologs; 
Database changed 


(0.00 sec) 


mysql> source homologs.dump; 


Query OK, 
Query OK, 1 row affected 


Query OK, 1 row affected 





Query OK, 1 row affected 


Query OK, 0 











Query OK, 1 row affected 
Query OK, 1 row affected 
Query OK, 1 row affected 
Query OK, 1 row affected 
Query OK, 1 row affected 
mysql> show tables; 
4-------------------- - 

| Tables _in_ homologs | 
4-------------------- - 

| genename | 

| organism | 
+-------------------- + 

2 rows in set (0.00 sec) 


O rows affected 


rows affected 


(Q..00" See) 
(0:.00-see) 
(0.00 sec) 
(0.00 sec) 

(0i.00' see) 
(0.00 sec) 
-00 sec) 
-00 sec) 
.00 


sec) 


-00 sec) 


mysql> select * from genename; 


4--------- 4------ 4------------ + 
| name | id | date | 
4--------- 4------ 4------------ + 
| aging | 118 | 1984-07-13 | 
| wrinkle | 9223 | 1987-08-15 | 
| hairy |) 272° |), 2990=09=30- | 
$oos----n- fossa n- foo ------ + 


3 rows in se 


mysql> selec 











t (0.00 sec) 


t * from organism; 
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6.6 Relational Database Design 


Database design is the process of effectively organizing data into tables in a relational 
database. When thinking about how to organize a database you need to ask, "What fields 
should I put together into tables, and how should I interrelate the tables?" In this section I will 
show you a short example that demonstrates some common and useful techniques for doing 
just that. 


Relational database software projects are best broken down into separate stages. This list 
from Database Systems: a Practical Approach to Design Implementation, and Management 
(see Section 6.10), shows the typical stages of database design and construction: 


Database planning 

System definition 
Requirements collection and analysis 
Database design 

DBMS selection 

Application design 
Prototyping 

Implementation 

Data conversion and loading 
Testing 

Operational maintenance 


For the small biology lab in which database programming may be a one-person project, some 
of these stages may be brief and informal, but they still apply. 


How should the tables be defined for a new database? The answer depends on the problems 
to be answered by the data, but it's also largely a matter of common sense and a feel for the 
data. The database beginner typically looks at the data, tries her hand at a few designs, and 
begins to get a sense of how tables can be used for a specific problem. Let yourself 
experiment and try a few alternatives, and you'll soon get the hang of it. 


Tables are commonly interrelated by indexing and by joining fields from different tables. The 
SQL language implemented with your DBMS provides these abilities. Also, a group of 
techniques called normalization can help you produce a good design and avoid some 
problems. Simply putting data into tables doesn't guarantee a good design. 


A set of rules called normal forms helps you arrange the data into tables in a way that avoids 
certain problems. One such problem is data redundancy, which is unnecessary duplication of 
data in different tables. A related problem is update anomalies caused by having the same 
data in more than one location. When such data is updated, copies may not be updated 
properly, and the database can become inaccurate. 


Here are some simple rules to follow when designing your database: 
e Each entry of each table has a single value. This is first normal form. 


e Each table has a a unique identifier (called a primary key) for each row. This is second 
normal form. 


e Names aren't used as identifiers because they can lead to data redundancy. 


Third and other normal forms as well as other design considerations aren't covered in this 
book due to space limitations. See Section 6.10 at the end of the book for more information 
about relational database design. Consider Table 6-3, which shows this alternate, 
unnormalized version of my homologs database. 
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Table 6-3. Unnormalized homolog data 


hairy 1990-09-30 mouse 











This table has multiple values in some of the locations in the table. To put the table into first 
normal form, I make new rows for each multiple entry (see Table 6-4). 


Table 6-4. First normal form homolog data 





wrinkle 1987-08-15 
wrinkle 1987-08-15 





In relational databases, a functional dependency exists between fields for which the value of 
one field is always associated with no more than one value in a second field. In Table 6-4, the 
value "wrinkle" in the Name field is associated with Organism "mouse" in one row, and with 
Organism "human" in another row. This isn't a functional dependency. On the other hand, the 
value "wrinkle" in the Name field is associated only with the value "1987-08-15" in the Date 
field. This is a functional dependency. In a real database, you need to ascertain that the test 
for a functional dependency will hold even as the database is updated; it requires knowledge 
of the use of the database and the possible range of values of the fields. In Table 6-4, the 
functional dependencies are: 

Name => Date 

Date => Name 


The primary key is a minimal group of fields that uniquely identifies each row. In Table 6-4, I 
can choose the pair of fields Organism/Name or Organism/Date as a primary key. 
Remember, a field is fully functionally dependent on a primary key if removing any field from 
the primary key destroys the functional dependency. 


Practically speaking, you can ensure second normal form by making sure each table has a 
field that is a unique identifier: usually this field has integer values such as 1, 2, 3, etc; no row 
is missing an identifier integer, and no two rows have the same integer. More formally, a 
relation in first normal form for which every non-primary-key field is fully functionally 
dependent on the primary key, is in second normal form. The rule-of-thumb solution is to 
divvy up the fields into new tables, and possibly add some new unique identifier keys. But how 
exactly do you do that? 


Looking over the data, I notice that Gene and Date are always associated with the same 
values (say they represent a gene name and the date it was first reported). I can make a 
new table Genes (Table 6-5) out of them, with Gene as the primary key. I can then make a 
second table Organism (Table 6-6) from the Organism field, adding an ID field OrgID. Finally, I 
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can make a third table Variants (Table 6-7) with its own field VarID unique for each row, an 
OrgID field, and a GeneID field. Table 6-7 contains a row for each gene in each organism in 
the database. Table 6-5 through Table 6-7 show my new design for the homologs database 
(in second normal form). 


Table 6-5. homologs database design in second normal form: Genes 


Table 6-6. homologs database design in second normal form: 

















Organism 
OrgID 
1 human 
2 worm 
3 mouse 








Table 6-7. homologs database design in second normal form: 


Variants 
pe 
2 2 
2 Le 
, Cs 
: En 








Each table contains a field that isn't null, that is unique for each row, and that I can use as a 
primary key. Such keys are called unique identifiers. These are Gene, OrgID, and VarID, in 
Table 6-5, Table 6-6, and Table 6-7, respectively. Notice that in each table the other fields are 
functionally dependent on the unique identifier row. And, finally, it's clear that the definition of 
"fully functionally dependent," which involves removing fields from a primary key, is satisfied 
when the primary key has only one field. 





My tables are now in second normal form and are much improved. You'll notice, however, 
that there is still a problem of data redundancy, which can lead to update anomalies. If, for 
instance, the name of the "aging" gene was changed to "fountain of youth," three changes 
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would have to be made to this database to keep all the data in sync. Using names as unique 
identifiers is a problem. Sometimes it works, but you must consider the operation of your 
database over time. In this case, genetic nomenclature may (and does) change. If I use a 
numeric ID as a unique identifier for a gene name, I can handle a gene name change by 
making a single update in only one table in my database (see Table 6-8, Table 6-9, and Table 
6-10). 


Table 6-8. homologs database design: Genes 


Table 6-9. homologs database design: Organism 














OrgID 

1 

2 

3 

Table 6-10. homologs database design: Variants 

if 1 118 

2 2 118 

: : 
, 
5 3 273 








Here's an example from my Linux system in which I drop the previous definition of the 
homologs database in my MySQL RDMS and redefine it with the three tables as just shown. I 
just define the tables here, but I don't populate them, as that will be done in Section 6.7.2: 


[tisdall@coltrane tisdall]$ mysql -u tisdall -p 

Enter password: 

Welcome to the MySQL monitor. Commands end with ; or ,,. 
Your MySQL connection id is 9 to server version: 3.23.41 








Type 'help;' or '\h' for help. Type '\c' to clear the buffer. 
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mysql&gt; drop database homologs; 
Query OK, 9 rows affected (0.13 sec) 


mysqlé&gt; show databases; 


caudyfly 
dicty 
gadfly 
yea. 
master 
mysql 
poetry 
rebase 
Pest 











10 rows in set (0.12 sec) 


mysql&gt; create database homologs; 
Query OK, 1 row affected (0.00 sec) 


mysql&gt; use homologs; 
Database changed 
mysql&gt; CREATE TABLE organism ( 

















) TYPE=MyISAM; 
Query OK, 0 rows affected (0.00 sec) 

















mysql&gt; CREATE TABLE genes ( 


geneid int(11) default NULL, 
gene char(20) default NULL, 





date date default NULL 
) TYPE=MyISAM; 
Query OK, 0 rows affected (0.01 sec) 











mysql&égt; CREATE TABLE variants ( 
varid int(11) default NUL 








orgid int(11) default NULL, 
organism char(20) default NULL 


A 








orgid int(11) default NUL 


at 4 








geneid int(11) default NULL 





) TYPE=MyISAM; 
Query OK, 0 rows affected (0.00 sec) 


mysql&gt; show tables; 


$-------------------- + 
| Tables _in_ homologs | 
+-------------------- + 
| genes | 
| organism | 
| variants | 
+-------------------- + 


mysqlé&gt; 


I have only briefly introduced some of the techniques useful in the design of a small database. 
There is, of course, training and skill involved in database design, as well as certain standard 
(and occasionally competing) methodologies; far more than I have the space to introduce 


here. 
| «PREVIOUS | 
[ 
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6.7 Perl DBI and DBD Interface Modules 


SQL is a fairly simple and easy-to-learn language, considered by most to be well-tailored to its 
task. However, there are things available in most programming languages, such as control 
flow (while, for, foreach) and conditional branches (if-else) that aren't provided in most 
implementations of the language. The lack of these abilities severely restricts the use of SQL 
as a standalone language. 


Most applications that use a relational database are written in another language such as Perl. 
Perl provides a link between the application, the programmer, the user, the files, the web 
server, and so on. Perl also provides the program logic. The interaction between the 
application and the database is typically to execute some database commands, such as 
fetching data from the database and processing it using Perl's capabilities. The logic of the 
program may depend on the data found in the database, but it is Perl, not SQL, that provides 
this logic (for the most part). 


In Perl, a set of modules have been written that allow interaction with relational databases. 
The DataBase Independent (DBI) module handles most of the interaction from the program 
code; the DataBase Dependent (or DataBase Driver) (DBD) modules, different for each 
particular DBMS, handle communicating with the DBMS. 


6.7.1 Installing and Configuring Perl DBI and DBD Modules 


To use a MySQL database from Perl, you need first to have installed and properly configured 
MySQL. This is not a Perl job, but a database administration job; you have to get MySQL and 
install it on your system and set up the appropriate user accounts and permissions. 


You have to then install the Perl DBI module ( 
http://www.symbolstone.org/technology/perl/DBI) using CPAN from the command line: 


perl -MCPAN -e shell; 





Then type: 
install DBI 


You can also install it by downloading the module from CPAN, for example, via a web browser 
and following the QUICK START GUIDE instructions shown here: 


QUICK. START GUIDES 








The DBI requires one or more 'driver' modules to talk to databases. 


Check that a DBD::* module exists for the database you wish to use. 























Read the DBI README then Build/test/install the DBI by doing 
perl Makefile.PL 
make 
make test 
make install 

Then delete the source directory tree since it's no longer needed. 








Use the 'perldoc DBI' command to read the DBI documentation. 


Fetch the DBD::* driver module you wish to use and unpack it. 
http://search.cpan.org/ (or www.activestate.com if on Windows) 
It is often important to read the driver README file carefully. 
Generally the build/test/install/delete sequence is the same 
as for the DBI module. 


























The DBI.pm file contains the DBI specification and other documentation. 
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PLEASE READ IT. It'll save you asking questions on the mailing list 
which you will be told are already answered in the documentation. 





For more information and to keep informed about progress you can join 
the a mailing list via mailto:dbi-users-help@perl.org 





To help you make the best use of the dbi-users mailing list, 
and any other lists or forums you may use, I strongly 
recommend that you read "How To Ask Questions The Smart Way" 
by Eric Raymond: 








http://www. tuxedo.org/~esr/faqs/smart-questions.html 


Much useful information and online archives of the mailing lists can be 
found at http://dbi.perl.org/ 


See also http://search.cpan.org/ 


Finally, you have to install the Perl DBD driver for MySQL, called DBD: :MySQL. Look in CPAN at 
http ://cpan.org/modules/by-module/DBD/ for the latest version; at the time of writing, it's 
http://cpan.org/modules/by-module/DBD/DBD-mysgql-2.1026.tar.gz. 








The combination of MySQL (the DBMS), DBD (the particular driver for your DBMS), and DBI 
(the Perl interface to the DBI and DBMS), is what gives the actual connection from Perl to the 
database and enables you to send SQL statements to the database and retrieve results. 


Getting these components installed is sometimes the most difficult part of getting involved 
with database programming. Installing and configuring MySQL has several steps, and if you 
are very new to computers, you may find some of the instructions difficult to follow, as they 
may assume that you know more about your computer system than you do. DBI and DBD 
are typically much easier to install, but you may run into snags with them as well. The help of 
experienced hands, either directly or by means of the type of mailing list mentioned in the 
QUICK START GUIDE, can make the difference between days of frustration and a successful 
installation. 


6.7.2 Handling Tab-Delimited Input Files 





Let's say you have the components installed (MySQL, Perl, DBD, DBI), and you want to write 
a program that talks to the database. I'll assume you've implemented a new version of the 
homologs database as shown. We'll now walk through a small Perl example that shows how 
to read data in from a file, populate a database, send queries, and retrieve results. 


First, here is the data as you might find it in a file. All the whitespace between the words is the 
tab character in the file, not space characters: 
































TABLE ORGANISM 

OrgId Organism 

1 human 

2 worm 

2 mouse 

TABLE GENES 

Geneld Gene Date 

118 aging 1984-07-13 
O223 wrinkle 1987-08-15 
293 hairy 1990-09-30 
TABLE VARIANTS 

Varid Orglid Geneld 

1 1 118 

2 2 118 

3 1 9223 

4 3 9223 
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2 3 273 


There are several ways to find the data in a database. It's common to have it in a plain file 
that has tables represented by lines, one table row on each line, with the field values 
separated by tabs or some other character that doesn't appear in any of the values of any 
field. 


You should bear in mind that this is just one of several possibilities for the source and format 
of your input data. See the interesting book Data Munging with Perl, by David Cross 
(Manning) for lots of useful lore about getting data in and out of various sources. 


SQL itself provides a utility for this purpose, called load, which assumes that you have a file 
consisting of only rows of data. You can specify what columns to load, what delimiter the file 
uses (tab by default), and a few other options. Its performance is optimized, and it is much 
faster than executing several SQL insert statements. However, you still need to read data in 
from files in different formats: what better than your own program that you can alter to suit 
any occasion? 


To start developing such a utility, here is a short program to populate the database. It reads 
the file and knows the table it's reading, the field names, and the data for each row: 


#!/usr/bin/perl 


use strict; 
use warnings; 


my Sflag = 0; 
my Stable; 

my @table; 

my @fieldnames; 
my @fields; 


while(<>) { 
LEC sNS*S/) -f 
# skip blank lines 


}elsif (/*TABLE\t(\wt)/) { 
output previous table 
print (@table) if S$flag; 














Sflag = 1; 
begin new table 
@table = ( ); 


Stable = $17 
push(@table, "\nTable is S$table\n"); 


} elsif ($flag = = 1) { 

@fieldnames = split; 

Sflag = 2; 

push(@table, "Fields are ", join("|", @fieldnames), "\n"); 
} elsif ($flag = = 2) { 

@fields = split; 

push(@table, join("|", @fields) . "\n"); 


} 


# output last table 
print @table; 


This program understands the file format I gave previously, reads it in, and then reformats it 
and prints it out. It's just an example of how you might read in data. In the following, I'll 

modify this program to read in the file, but instead of printing out the (reformatted) tables, it 
sends SQL commands to the MySQL database to insert the data into the appropriate tables. 


As you see, this first version of the program uses the $flag variable to keep track of what 
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it's reading. Every time the input line begins with TABLE\t (that \t is a tab that actually shows 
up as whitespace), the program outputs the previously read table (if $flag indicates there 
was one). It then saves the next word as the table's name, sets the $flag to 1, and 
prepares some output in the array @table. 


Otherwise, if the $f£lag variable is set to 1, the program knows it's on the second line of a 
table (remember, this program is specially written for the input file format I gave previously). 
In this case, it saves the names of the fields in an array, and then reformats them and adds 
them to the @table output array. 


Finally, if the $flag variable is set to 2, the program knows it's reading rows of the table; it 
reformats them and adds them to the @table output array. 


When all the input is done, and the while loop finishes, there will be the last table's 
reformatted output ready to be printed from the @table output array. 


If I call this program homologs.getdata and give it my data file homologs.tabs like so: 


° 


% perl homologs.getdata homologs.tabs 


I get the following output: 


Table is ORGANISM 

Fields are OrgId|Organism 
1|human 

2|worm 

3|mouse 





Table is GENES 

Fields are GenelId|Gene|Dat 
118 |aging|1984-07-13 
9223|wrinkle|1987-08-15 
273 |hairy|1990-09-30 














Table is VARIANTS 

Fields are VarId|OrgId|GenelId 
1/1/118 

2|2|118 

31/1|9223 

4|3|9223 

o(3 (273 


Notice that all I've really done here is read in the data and print it out in a slightly different 
format; among other things, I've changed the delimiter between fields from a tab to a vertical 
bar, a common type of task with these database dumps. But now, let's see how to interact 
with an actual database. 


6.7.3 DBI Examples 


Let's take the homologs.getdata program from the previous section and add the DBI calls to 
the MySQL database that will populate the MySQL database with the read-in data. 


6.7.3.1 homologs.tabs 


For starters, let's just see a Perl program that connects to the database, asks a simple 
question ("What tables are in this database?"), displays the results, and disconnects: 


#!/usr/bin/perl 


use strict; 
use warnings; 


# Make connection with MySQL database 
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use DBI; 


my Sdatabase = 'homologs'; 

my Sserver = 'localhost'; 

my Suser = 'tisdall'; 

my Spasswd = 'NOTmyPASSWORD'; 

my Shomologs = DBI->connect ("dbi:mysql:Sdatabase:S$server", Suser, Spasswd); 


# prepare an SQL statement 


my Squery = "show tables"; 
my Ssql = Shomologs->prepare (Squery) ; 


# execute an SQL statement 

Ssql->execute( ); 

# retrieve and print results 

while (my $row = $sql->fetchrow_arrayref) { 


print join("\t", @$row), "\a") 


} 





# Break connection with MySQL database 
Shomologs->disconnect; 


exit; 
Here's the result of running this program: 


GENES 
ORGANISM 
VARIANTS 














This program does the basic tasks that all DBI programs have to do, so let's examine them in 
useful detail. 


After the obligatory use DBI; that loads the DBI module, I declare some variables to hold the 
string that specifies to DBI who is connecting to what and where. The actual connection 
happens here: 


my Shomologs = DBI->connect ("dbi:mysql:Sdatabase:S$server", Suser, Spasswd) ; 


The connect method is asked to connect to the particular database (in this case the 
homologs database) on the local computer (localhost; this can be replaced by a URL of 
another computer) as the user with the given password. mysql is specified; if you are using 
another DBMS, you should change this to reflect your database system. Other optional 
arguments, not used here, are possible (see the documentation). The connect method is the 
DBI new method that creates (and initializes) a DBI object. 


The output of the connect call is saved as a new DBI object in the reference variable 
Shomologs. 


Next, the DBI object prepares an SQL statement, and the result is saved as a new statement 
object, which I call $sql here. This statement object $sql then calls its own execute method 
which actually does the job of sending the SQL to the database. 


The actual SQL here is a very simple one. You've already seen that DBI->connect specifies 
the particular database on your MySQL (homologs in this case). This SQL statement show 
tables simply asks for a list of the names of the tables that are defined in that database. 


After execute, the program retrieves the results of executing the SQL statement. There are 
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several ways to retrieve results. Here, the statement object method fetchrow_ arrayref is 
called in a loop to fetch all the rows of the result; at each pass through the loop, the return 

value in $row points to the array of fields in that row. Here, I simply separate the fields with 

tab characters with the help of the Perl join function on the dereferenced array @$row and 

print the row with a newline. 


The last DBI call is to disconnect from the database. This is actually an important call to 
make; depending on your implementation and how you're running your program, it is 
sometimes possible to open a number of connections and eventually tax the MySQL DBMS to 
the point where it has to refuse any more connect requests. Especially if you have an active 
database with regular queries coming in, you want to disconnect as soon as possible from 
each connect 


The program you've just seen is the kind of sample program you should run when you first 
try to install and use the DBI module. If it works, you're in business. If not, you have to 
closely examine the error messages you get to identify where the problem is. One thing that 
can help is to add an additional argument to the connect call that asks for more error 
reporting, like so: 
my Shomologs = DBI->connect ( 

"dbi:mysql:S$database:Sserver", Suser, Spasswd, {RaiseError=>1} 





)3 


This terminates the program with extra error messages if the connect call fails; this is usually 
where things go wrong when you first try to use this software. (See the documentation for 
other such options to connect.) Remember that you need a username and password for 
MySQL itself, and that these aren't related in any way to the username and password on your 
computer outside of MySQL. You also need to have the database defined (e.g., the SQL 
statement create database life creates a database called life), and you have to have 
your MySQL permissions set properly to allow you to do the things you want to do, such as 
create or modify databases, or see other databases on the system. If there's a problem, ask 
your database administrator, consult your MySQL documentation, or visit the MySQL web 
site. 


6.7.3.2 homologs.load 


This next program does a little more than the previous; it reads in the tab-delimited file 
homologs.tabs, extracts the data, and uses it to load the tables in a MySQL database. Read 
it over: most of it will be familiar from previous programs in this chapter. 


!/usr/bin/perl 


use strict; 
use warnings; 


Make connection with MySQL database 





use DBI; 
my Sdatabase = 'homologs'; 
my Sserver = 'localhost'; 
my Suser = 'tisdall'; 
my Spasswd = 'NOTmyPASSWORD'; 
my Shomologs = DBI->connect ("dbi:mysql:Sdatabase:S$server", Suser, Spasswd); 
my Ssqlinit = Shomologs->prepare("show tables"); 
Ssqlinit->execute( ); 
while (my $row = $sqlinit->fetchrow_arrayref) { 
Print Foin(™\i", @Srow); “\a"> 


} 


my Sflag = 0; 
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my Stable; 
my @tables; 
my Ssql; 


while(<>) { 
# skip blank lines 
LE(/ N\S*8/) 4 


next; 


# begin new table 
}elsif (/*TABLE\t(\wt)/) { 
Sflag = 1; 
Stable = $1; 
push(@tables, Stable); 
# Delete all rows in database table 
my Sdroprows = Shomologs->prepare ("delete from Stable"); 
Sdroprows->execute( ); 

















# get fieldnames, prepare SQL statement 


} elsif ($flag = = 1) { 

Sflag = 2; 

my @fieldnames = split; 

my Squery = "insert into Stable (" 
join(",", @fieldnames) 
") values (" 
"2, " x (@fieldnames-1) 
ea kd 


Ssql = Shomologs->prepare ($query) ; 


# get row, execute SQL statement 
} elsif ($flag = = 2) { 
my @fields = split; 
Ssql->execute( @fields); 





} 


# Check if tables were updated 


foreach my Stable (@tables) { 
my Squery = "select * from Stable"; 
my Ssql = Shomologs->prepare (Squery) ; 
Ssql->execute( ); 


while (my $row = $sql->fetchrow_arrayref) { 
print join("\t", @$row), "\n"; 


} 


# Break connection with MySQL database 
Shomologs->disconnect; 


exit; 
This program is called by giving it the name of the tab-delimited file on the command line (the 
same file used previously with the homologs.getdata program): 


% perl homologs.load homologs.tabs 


This is the output of the program: 


GENES 
ORGANISM 
VARIANTS 
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Table: ORGANISM 


1 human 
2 worm 
3 mouse 





Table: GENES 











118 aging 19384=07=13 
9223 wrinkle 1987-08-15 
273 hairy 1990-09-30 


Table: VARIANTS 


1 1 118 
2 2 118 
2 1 O72 3 
4 3 9223 
2 3 2S 


This homologs.load program is very much like the previous homologs.getdata program. 
However, instead of building an output array @tables and printing the text to the screen, 
homologs.load puts the data into the MySQL database using SQL statements. When the 
database is loaded, it retrieves the data from the tables and prints it to the screen. 


Notice that each time homologs. load finds a new table, it first empties all rows of that table 
in the database and then proceeds to read in the lines of data and insert new rows into the 
table. 


Notice also that when the program reads the line of the input file that names the fields (when 
$flag equals 1), it prepares the SQL statement, saving the object in the $sql variable. Then, 
when the actual lines of data are read (when $flag equals 2), the values of the fields are 
passed to the execute command, which sends the SQL command by the $sql object to the 
database system. 


6.7.3.3 An SQL query 


The SQL statement that is the argument to the prepare method is built up from information 
that the program knows at that point. Here it is again: 
my Squery = "insert into Stable (" 

join(",", @fieldnames) 

") values (" 


"2, " x (@fieldnames-1) 
bas ae 


This is a bit hard to read at first sight. However, it is typical of what happens when you use 
one language (Perl) to make a statement in another language (SQL). So I'll explain this one 
carefully as a good example of the breed. 


The SQL query is formed by five strings that are joined by the dot (.) string operatorrecall 
that "r" . "DNA" has the value "rDNA". Here are the five strings being joined: 


"insert into Stable (" 
join(",", @fieldnames) 
") values (" 

"2, " x (@fieldnames-1) 
mm) W 


The question marks in the SQL statement are bind variables that are passed the values from 
the @fields array when the execute statement is called with: 


Ssql->execute( @fields ); 





If the query is called with: 
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Stable = EXONS 





and: 





@fieldnames = (Exon, Position) 


the resulting SQL statement is: 


"insert into EXONS (Exon,Position) values (?, ?)" 








How does this statement get constructed? Let's look at it in detail: 
e The first string just interpolates the table name EXONS into the string. 
e The second string joins the field names with commas. 
e The third string appears as is. 


e The fourth string uses the Perl x string operator to make a new string that has a 
certain number of copies of the original string "?, ". The desired number of copies is 
specified on the right side of the x operator. The string itself is on the left side of the x 
operator, "?, ". 


@fieldnames in a scalar context returns the number of elements of the @fieldnames 
array; I need one less '"?, " than that plus an additional question mark without a 
comma (because no commas are allowed after the last item in the list in SQL). So, 
@fieldnames-1 is the desired number of copies of the string "?, ". 


e The fifth string "?)" is just the last question mark and the closing parenthesis. 


The final result is: 








insert into EXONS (Exon, Position) values (?, ?) 


As the Perl program reads in the data rows from the input file (when $flag equals 2), the 
values are placed in the array @fields and then passed as variables (to take the place of the 
question marks in the SQL statement that's been prepared) to the execute method, which 
sends the SQL statement to the database system. These question marks are the bind 
variables; here, I need one for each field, because I'll pass in the field values when execute is 
eventually called on this statement. 


To check what actually happened to the database after the reading in and processing of the 
file is complete, the program sends an SQL query for each table to see what's in it. This is 
done with the SQL select commanda general-purpose command to get information out of a 
database that has a great many options. 


Here, I'm asking to see all fields (*) from the database table, and because no restrictions are 
added, SQL shows us all the fields of all the rows. As before, the result of this SQL query is 
read using the fetchrow_arrayref DBI method in a while loop, and each resulting row is 
printed with tab-separated fields. 


This last program homologs.1load is a typical DBI program, interacting with the world through 
Perl (e.g., reading in files and displaying the results to the user) and also interacting with the 
database through the Perl DBI module and SQL statements. 


i 
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6.8 A Rebase Database Implementation 


In earlier chapters, I developed an object-oriented interface to Rebase that stores the data in 
a simple DBM hash database, and in this chapter I showed how to interact with a MySQL 
relational database management system. 


Now, let's put the two things together. 


In this section, I'll take the Rebase.pm module that implements the Rebase class and modify it 
to use a relational database instead of Perl's simple hash-based DBM database. I'll call the 
resulting class RebaseDB. 


For a small problem like this, either approach works pretty well. In fact, on my computer, the 
DBM approach is considerably faster than the MySQL version. 


The relational database implementation has a slower response due to the more complicated 
DBMS system involved and increased overhead in programming because a greater number of 
programming statements must be written. However, the reason for using it is probably clear: 
the relational database provides a lot more flexibility in making queries, organizing the data, 
and, especially, expanding the application to handle a greater variety of data. 


This flexibility is very important. The Rebase database from http://www.neb.com/rebase 
includes several more datafiles, such as where to get the enzymes. It wouldn't be difficult to 
add a table or tables to store that information in the relational version of my Perl program; it 
would be a major pain to keep adding new DBM hashes for various complex relationships. So, 
I want a relational database because it has good scalability as the database application 
grows. 





6.8.1 RebaseDB Class: Accessing Restriction Enzyme Data 


Following is the code for a new class, RebaseDB, which aims to provide the same functionality 
as the previous class Rebase, but with a relational database instead of a DBM hash data 
storage. 


In the interest of space, the code for the following subroutines isn't reproduced here; look for 
them in Chapter 5: revcomIUB, complement IUB, and IUB_to_regexp. They are, of course, 
included in the RebaseDB.pm module available for downloading from this book's web page. 








package RebaseDB; 


L 

# A simple class to provide access to restriction enzyme data from Rebase 
including regular expression translations of recognition sites 

Data is stored in a MySQL database 


use strict; 
use warnings; 
use DBI; 

use Carp? 








Class data and methods 


# A hash of all attributes with default values 
my % attributes = ( 


_rebase => { }, # unused in this implementation 
# key = restriction enzyme name 
# value = pairs of sites and regular 
expressions 
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i 


_mysql => '??', # e.g. mysql => 'rebase:localhost', 
dbh ae # database handle from DBI->connect 





_bionetfile => '??', # source of data from e.g. “bionet.212" file 


# Return a list of all attributes 
sub _all attributes { 


} 
} 


fo) 


keys % attributes; 


# The constructor method 
# Called from class, e.g. $obj = Rebase->new( mysql => 'localhost:rebase' 
sub new { 


my 





(Sclass, sarg) = @ ; 


# Create a new object 


my 





Sself = bless { }, $class; 


# Set the attributes for the provided arguments 








foreach my Sattribute ($self->_all_attributes( )) { 
+ Esgs attribute = " name", argument = "name" 
my (Sargument) = (Sattribute =~ /* _(.*)/); 


} 





if (exists Sarg{Sargument}) { 
if(Sargument eq 'rebase') { 
croak "Cannot set attribute rebase"; 





} 
Sself->{Sattribute} = Sarg{Sargument}; 





# MySQL host:database string must be given as "mysql" argument 
unless(Sarg{mysql}) { 


} 


croak("No MySQL host:database specified"); 


# Connect to the Rebase database 


my 
my 
my 


Suser = 'tisdall'; 
Spasswd = 'NOTmyPASSWORD'; 
Sdbh; 


unless(S$dbh = DBI->connect ("dbi:mysql:Sarg{mysql}", Suser, Spasswd) ) 


} 





carp "Cannot connect to MySQL database at Sarg{dbmfile}"; 
return; 


Sself->setDBhandle ($dbh) ; 





{ 


de 


# If "bionetfile" argument given, populate the database from the bionet 


file 
if 


LS 


# For 


(Sarg{bionetfile}) { 


Sself->parse rebase( ); 


turn Sself; 





this simple class I have no AUTOLOAD or DESTROY 





# No "set" mutators: all initialization done by way of "new" constructor 


sub ge 





t_ regular expressions { 


my ($self, Senzyme) = @ ; 
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sub 


sub 


sub 


sub 


sub 


my Sdbh = Sself->getDBhandle; 

my Ssth = Sdbh->prepare ( 

"select Regex from REGEXES, ENZYMES where 
ENZYMES.EnzId = REGEXES.EnzId and ENZYMES.Enzyme=?' 






























































); 
Ssth->execute (Senzyme) ; 








my @regexes; 

while( my $row = $sth->fetchrow_arrayref) { 
push(@regexes, S$Srow[0]); 

} 


return @regexes; 


getDBhandle { 
my($self) = @ ; 


return S$self->{_dbh}; 


setDBhandle { 
my(S$self, Sdbh) = @ ; 


vs 





return $self->{ dbh} = $dbh; 





get_recognition sites { 
my(S$self, Senzyme) = @ ; 





my Sdbh = Sself->getDBhandle; 
my Ssth = Sdbh->prepare ( 
"select Site from SITES, ENZYMES 
where ENZYMES.EnzId = SITES.EnzId and ENZYMES.Enzyme=?' 



























































); 
Ssth->execute (Senzyme) ; 








my @sites; 

while( my $row = $sth->fetchrow_arrayref) { 
push(@sites, S$Srow[0]); 

} 


return @sites; 


get_bionetfile { 
my($self) = @ ; 





return Sself->{_bionetfile}; 


parse rebase { 
my(Sself) = @ ; 


handles multiple definition lines for an enzyme name 
also handles alternat nzyme names on a line 











Get database handle 
my Sdbh = Sself->getDBhandle( ); 


ables, recreate them 
handles with "bind" variables and autoincrement 


Delete existing 
Prepare statement 























ENZYMES table 
my Sdrop = Sdbh->prepare('drop table if exists ENZYMES"); 
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Sdrop->execute ( 


Ve 


my Screate = Sdbh->prepare ( 





"CREATE TABLE 




















ENZYMES ( EnzId int(11) NOT NULL auto_increment default 





TO"; 


TYPE=MyISAM" 
i 
Screate->execute ( 




















Enzyme varchar(255) NOT NULL default '', PRIMARY KEY (Enzi) ) 





i 


# Prepare filehandles outside of "while" loop 








my Senzymes_ selec 
"select EnziId 
); 


t = Sdbh->prepare ( 
from ENZYMES where Enzyme=?' 

















t = Sdbh->prepare ( 





my Senzymes_ inser 








‘insert ENZYM 





ES ( EnziId, Enzyme ) values ( NULL, ? )' 











\F 


# SITES table 


Sdrop = S$dbh->prepare('drop table if exists SITES'); 


Sdrop->execute ( 
S$create = $dbh->p 


)e 
repare ( 








"CREATE TABLE 

















1O%;, 
EnzId int(1l 
default '', 

PRIMARY KEY 
3 
Screate->execute ( 











SITES ( SiteId int(11) NOT NULL auto_increment default 
) NOT NULL default '0', Site varchar(255) NOT NULL 


(SiteId)) TYPE=MyISAM" 





i 


# Prepare filehandles outside of "while" loop 





‘insert SITES 





"select Enzid 








my Ssitesrevcom_s 
"select Enzid 











# REGEXES table 











my Ssites insert = $dbh->prepare ( 


( SiteId, EnzId, Site ) values ( NULL, ?, ? )' 





my Ssites_ select = $dbh->prepare ( 


, Site from SITES where EnzId=? and Site=?' 


elect = Sdbh->prepare ( 
, Site from SITES where EnzId=? and Site=?' 

















Sdrop = S$dbh->prepare('drop table if exists REGEXES'); 


Sdrop->execute ( 
Screate = S$dbh->p 





); 


repare ( 








"CREATE TABLE 




















default 'O', 

EnzId int(11 
default '', 

PRIMARY KEY 
\; 
Screate->execute ( 
# Prepare filehan 











REGEXES ( RegexId int(11) NOT NULL auto_increment 











) NOT NULL default '0', Regex varchar(255) NOT NULL 





(RegexId)) TYPE=MyISAM" 


); 
dles outside of "while" loop 








my Sregexes inser 


t = Sdbh->prepare ( 











‘insert REGEX 








ES ( RegexId, EnzId, Regex ) values ( NULL, ?, ? )' 





); 
my $lastid = S$db 


# Read in the bio 





h->prepare('select LAST INSERT ID( ) as pk'); 





unless (open (BIONE 
croak ("Cannot 





} 


while (<BIONETFH>) 








net (Rebase) fil 
TFH, S$self->get_bionetfile)) { 
open bionet file " . $self->get_bionetfile); 


{ 
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my @names = ( ); 


Discard header lines 
( 1 .. /Rich Roberts/ ) and next; 


Discard blank lines 
/*\s*S/ and next; 


Split the two (or three if includes parenthesized name) fields 
my @fields = split( "", $_); 





Get and store the recognition site 

my Ssite = pop @fields; 

For the purposes of this exercise, I'll ignore cut sites (%). 
This is not something you'd want to do in general, however! 
Ssite =~ s/\*//g; 








Get and store the name and the recognition site. 
Add alternate (parenthesized) names 

from the middle field, if any 

foreach my Sname (@fields) { 














if(Sname =~ /\(.*\)/) { 
$name =~ s/\((.*)\)/$1/; 

} 

push @names, Sname; 


} 


# Store the data, avoiding duplicates (ignoring * cut sites) 
# and ignoring reverse complements 
foreach my Sname (@names) { 


my Spk; 
my Srow; 


# if enzyme exists 
Senzymes_ select->execute ($name) ; 








if(Srow = Senzymes_ select->fetchrow_arrayref) { 
# get its "pk" 
Spk = $Srow[0]; 
jelse{ 


# Add new enzyme definition 
Senzymes insert->execute ($name) ; 








# Get last autoincremented primary id 
Slastid->execute( ); 
my Spkhash = $lastid->fetchrow_hashref; 
Spk = $pkhash->{pk}; 

} 


# if pk,site exist go to top of loop 

Ssites select->execute($pk, S$site); 

if(Srow = Ssites_select->fetchrow_arrayref) { 
next; 








} 

# and if pk,revcomIUB(site) exist go to top of loop 

Ssitesrevcom_select->execute($pk, revcomIUB(S$site) ); 

if(Srow = Ssitesrevcom_select->fetchrow_arrayref) { 
next; 





} 


# Add new site definition 
# since neither pk,site nor 
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# pk,revcomIUB(site) exists. 
Ssites_insert->execute(S$pk, $site); 





# Add new regex definition 
Sregexes insert->execute ($pk, IUB_to_regexp(S$site) ); 








} 


return 1; 


Ly 


=headl RebaseDB 








Rebase: A simple interface to recognition sites and translations of them into 
regular expressions, from the Restriction Enzyme Database (Rebase) 








=headl Synopsis 


use RebaseDB; 





my Srebase = RebaseDB->new ( 
mysql => 'rebase:localhost', 
bionetfile => 'bionet.212' 
); 
































my Senzyme = 'EcoRI'; 

print "Looking up restriction enzyme Senzyme\n"; 

my @sites = Srebase->get_recognition_sites ($enzyme) ; 

print "Sites are @sites\n"; 

my @res = $rebase->get_ regular expressions ($enzyme) ; 

print "Regular expressions are @res\n"; 

my Senzyme = 'HindIII'; 

print "Looking up restriction enzyme Senzyme\n"; 

my @sites = Srebase->get_recognition_sites ($enzyme) ; 

print "Sites are @sites\n"; 

my @res = $rebase->get_regular_ expressions (S$enzyme) ; 

print "Regular expressions are @res\n"; 

print "Rebase bionet file is ", Srebase->get_bionetfile, "\n"; 
=headl AUTHOR 





James Tisdall 
=headl COPYRIGHT 
Copyright (c) 2003, James Tisdall 


=cut 


6.8.2 testRebaseDB: A Testing Program 


The following test program testRebaseDB is taken from the RebaseDB.pm documentation and 
slightly altered. 
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use lib "/home/tisdall/MasteringPerlBio/development/1lib"; 





use RebaseDB; 
my Srebase = RebaseDB->new ( 
mysql => 'rebase:localhost', 
bionetfile => 'bionet.212' 
\; 
my Senzyme = 'EcoRI'; 
print "Looking up restriction enzyme Senzyme\n"; 
my @sites = Srebase->get_recognition_sites ($enzyme) ; 
print "Sites are @sites\n"; 
my @res = $rebase->get_regular_ expressions (S$enzyme) ; 
print "Regular expressions are @res\n"; 
my Senzyme = 'HindIII'; 
print "Looking up restriction enzyme Senzyme\n"; 
my @sites = Srebase->get_recognition_sites ($enzyme) ; 
print "Sites are @sites\n"; 
my @res = $rebase->get_regular_ expressions ($enzyme) ; 
print "Regular expressions are @res\n"; 
print "Rebase bionet file is ", Srebase->get_bionetfile, 
Here's the output of testRebaseDB: 
Looking up restriction enzyme EcoRI 
Sites are GAATTC 
Regular expressions are GAATTC 
Looking up restriction enzyme HindIII 
Sites are AAGCTT 
Regular expressions are AAGCTT 


















































Rebas 


bionet file is bionet.212 


6.8.3 Analyzing RebaseDB 


Let's take a walk through this RebaseDB.pm class module to see how it uses the DBI relational 
database interface. 


we Aga se 


For starters, the s_ attributes hash has changed to reflect the new relational database 

method. The _rebase key is no longer needed, but the bionetfile is again used to load the 
database; in the absence of this argument, the program attempts to use a previously loaded 
database. 


Two new argument keys are present. mysql gives the database name (e.g., rebase) and 
the user's computer (localhost if the user is running the program on the same computer the 


database is served from). _dbh holds the DBI object returned from the D1 


BI->connect call. 





The new constructor method has changed in significant ways. It checks the arguments a little 
bit differently, of course, because this version of the class has different arguments coming in. 
It then actually connects to the MySQL database using the DBI calls seen previously in this 
chapter. In case of failure, it doesn't carp or die; it prints an error message and returnsa 
much better behavior than just dying. If this were a web script, for instance, you might want 
to keep running so you can return more input from the user in case of failure. 


In case of success, the DBI->connect object reference is saved in the attribute reserved for 
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this purpose, $self->{_dbh}. (dbh stands for "data base handle".) Elsewhere in the module 
two methods are defined: the accessor method, getDBhandle and the mutator method, 


setDBhandle. They get and set this MySQL database object handle in the class attribute 
dbh, 





The program then calls the parse _rebase program. Let's look closely at this method and the 
workhorse that reads Rebase data in from a file and populates the MySQL database with it. 


After retrieving as $dbh the object that points to the database, the method sends several SQL 
statements to the DBMS. These statements delete each of the three tables in this MySQL 
database called rebase, tossing out all their data in the process. The calls then create the 
tables anew (and empty). 


This section of the code also prepares SQL statements that insert new rows with the values 
being passed in as bind arguments and selects various sets of rows. These SQL statements 
are used repeatedly in the while loop that follows, and the program saves time by having the 
SQL statements prepared just once before the loop. 


I won't examine the various SQL statements in detail here. However, I do want to point out 
the use of autoincrementing on the ID field of each table. This option, applicable only to 
single-field keys that serve as the primary key for a table, has the effect of always picking the 
next value for the field, and is perfect for the unique ID field I usually want in a table as the 
primary key. It's just one less thing to worry about in the module. An SQL statement is also 
prepared that uses an SQL function to return the value of the last autoincremented key. 


Now that the new database tables are fresh and empty, the program opens the bionet file 
and prepares to read it within a while loop. The beginning of this loop is unchanged from the 
previous version Rebase.pm because it discards the file header and blank lines and extracts 
the names of the restriction enzymes and the associated recognition sites. 


First, the program checks to see if there is already a definition for the enzyme entered in the 
database; if so, it just retrieves its unique ID, the primary key $pk. If the enzyme hasn't been 
entered, it is now, and the ID that is automatically created for it is saved. 


Next, the program checks if that primary key ID and site are already paired in the SITES table; 
if so, it goes back to the top of the while loop. 


Again, the program checks if that primary key ID and the reverse complement of the site are 
already paired in the SITES table; if so, it goes back to the top of the while loop. 


However, if the ID and the site are still unknown, they are entered into the SITES table, and 
the ID and the regular expression generated from the site are entered into the REGEXES 
table. 


That wraps it up for the parse _rebase method. Because of all the database interactions, this 
is a pretty slow bit of code (see the exercises). 


The only parts of the RebaseDB. pm module code we haven't looked at are the two similar 
methods called get_recognition_ sites and get_regular_expressions. They retrieve all 
recognition sites (or regular expressions) associated with a given enzyme. The SQL 
statement in each of these methods is an example of a join of two tables: 

select Regex from REGEXES, ENZYMES 

where ENZYMES.EnzId = REGEXES.EnzId and ENZYMES .Enzyme=?; 

































































(This is one SQL statement.) The statement asks for the regular expressions from the 
REGEXES table that have the same EnzID as the enzyme given in argument has in the 
ENZYMES table. 





That's the end of my discussion of the RebaseDB.pm class module. See the exercises for 
suggestions on how to improve this module. 
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6.9 Additional Topics 


In this short chapter, I have not mentioned several topics that are of considerable importance 
in database work: 


SQL has only been demonstrated, not fully laid out. It's a fairly simple language; if 
you'll be doing much database work, you should read the language manual that comes 
with your DBMS carefully, and look at several examples. Tutorial books are also 
available for most popular DBMS. 


Transactions are loosely defined as a set of database queries and modifications that 
belong together. For example, you may enter a new gene into your database by 
updating several relevant tables. If your system should experience a failure in the 
middle of such a set of updates, it can leave your database in an ill-defined state. By 
defining transactions, it's possible to avoid such undesirable states, and many DBMS 
now provide support for this view of database updates. 


Entity-relationship modeling is a top-down design methodology with a fairly simple 
graphics representation that signifies relationships between the data in a system. 


Stored procedures are parts of a database application that are performed at the point 
of entry when a user is filling out a form for instance. MySQL is just now starting to 
support them; major DBMS such as Oracle have had them for a long time. They tend 
to improve performance of an application when used well; they also greatly increase 
the difficulty of migrating to a different DBMS if that becomes necessary in the future. 


Object-oriented and object-relational databases are new data models that are finding 
some acceptance. You may come across them on the job, although their use is much 
more limited than the standard relational model. 


4 PREVIOUS || NEXT > 
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6.10 Resources 


The literature on databases is very large, and so the documentation for your particular RDMS 
is essential. For this book, I use MySQL, but many others are suitable. 


Here are a handful of basic textbooks: 


e Database Systems: A Practical Approach to Design, Implementation, and 
Management, Thomas Connolly and Carolyn Begg (Pearson Addison Wesley) 


e A First Course in Database Systems, Jeffrey Ullman (Prentice Hall) 

e An Introduction to Database Systems, C.J. Date (Pearson Addison Wesley) 
e MySQL, Paul DuBois (SAMS) 

e The MySQL Cookbook, Paul DuBois (O'Reilly & Associates) 

e MySQL and Perl for the Web, Paul DuBois (SAMS) 


e Programming the Perl DBI, Alligator Descartes and Tim Bunce (O'Reilly) 
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6.11 Exercises 


Exercise 6.1 


What's the difference between a database and a database management system? 
What's the difference between MySQL and Oracle? 


Exercise 6.2 


What are the limitations of the simple built-in Perl hash database DBM compared to a 
relational database? Its strengths? 


Exercise 6.3 


Is my homologs database, which was placed into second normal form, also in third 
normal form? 


Exercise 6.4 
Write a program that checks if a relation is in second normal form. 
Exercise 6.5 





RebaseDB.pm is written to specify a MySQL database. Rewrite the module so it can 
specify another relational database supported by DBI. 


Exercise 6.6 


The parse _rebase method of the RebaseDB.pm module uses several database queries 
per input line as part of the logic of avoiding duplicate entries for palindromes and 
reverse complements. Compare it with the parse _rebase method in Rebase.pm. 
Make a new parse_rebase method that works for both DBM and DBI database 
storage and will improve the efficiency of the DBI method by minimizing database 
queries. 


Exercise 6.7 





RebaseDB.pm is a port of the earlier version Rebase.pm that used the DBM hash 
database. Make a version of this module that handles both DBM and MySQL databases 
depending on the arguments passed to the new method. 


Exercise 6.8 


Compare MySQL with some other relational database management system; include 
such management issues as cost of purchase, cost of maintenance, availability of 
skilled personnel, stability of vendor in the marketplace, and customer support. 


Exercise 6.9 


Given the many possible alternate forms of a small set of simple relations, would you 
say that the relational model is not specific enough? What constraints might you add 
to improve the quality and reduce the number of possible relational designs? 


Exercise 6.10 


Name two bioinformatics problems that are ill-served by the data structures of a 
relational database. 


Exercise 6.11 
Implement a relational database that supports a project in your wet lab. 
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Chapter 7. Perl and the Web 


One of the basic skills of a programmer is designing and putting up an interactive web site. 
This is as true for bioinformaticians as it is for other programmers. 


Not every bioinformatician will need to implement a web site. Your work in bioinformatics may 
specialize early; perhaps you work in analyzing algorithms for representing gene cascades, 
while web programming is someone else's responsibility. 


However, the Web has become the principal way to provide programs to users. This is 
certainly true in biology, where it is typical for laboratories to provide programs and access to 
data via the Web. Collaborations can be promoted between research groups, the need for 
publication of results can be addressed (witness the many peer-reviewed journal articles that 
invite readers to visit web sites for supporting information), and valuable research tools can 
be widely disseminated. 


This chapter provides a quick introduction to the way web pages and the Internet work, 
followed by a closer look at some important parts of web programming. You'll see these 
techniques applied to create an interactive web page that accepts a sequence and enzyme 
names and returns a restriction map. 


Along with the remarkable growth in use of the Web and the Internet have come an equal 
proliferation of languages, systems, and tools for web programming. There are now a variety 
of choices in how a programmer can create interactive web pages. 


We will use one such system that employs the Perl language. To make the web pages 
interactiveto enable users to type in queries and get responseswe'll use the popular CGI.pm 
Perl module that's shipped with all recent versions of Perl. CGI stands for the Common 
Gateway Interface, a protocol that provides dynamic web contentweb pages that change for 
various reasons (such as a user asking for a restriction map)as opposed to static, unchanging 
web pages. Perl and CGI were an early success story in programming the Web and remain 
widely available as a standard web programming technique. They constitute a basic skill set in 
web programming, despite the many alternatives now available. 


Because bioinformatics programmers need at least a basic working knowledge of web 
programming, I'll teach you the skills necessary to put an interactive web page on the Web; 
the example is based on a continuation of the restriction map example from previous 
chapters. 


To begin, I'll give a brief run down of the chief components of web technology to give you the 
basic overview and terminology you'll need to do web programming and to read further in the 
field. If you're already familiar with servers, browsers, HTML, and the other components of 
web programming, feel free to skip ahead. 
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7.1 How the Web Works 


The Internet is short for "interconnected networks." It is a set of conventionsprotocolswith 
which computers and networks can intercommunicate. Its development from earlier work 
before 1980 allowed many different networks to join and users on many computers to 
communicate. This communication was originally done in several ways, such as by email, 
electronic mail, and by FTP, file transfer protocol. These methods remain very popular and 
widely used. 


It wasn't until the early 1990s that the World Wide Web or Web was born as a new Internet 
service. The Web was based on the new hypertext transport protocol or HTTP, and the first 
software to use it was in the form of programs called web browsers and web servers. Web 
browsers are programs that handle user requests and display results to the user; the most 
widely known web browsers include Internet Explorer and Netscape. Web servers are 
programs that accept requests from web browsers and send results back to them for display; 
Apache is the most widely used web server. With the development of web browsers and their 
ability to handle images as well as text, these new protocols sparked intense popular interest 
in computers. At the same time, computer costs were falling steadily, and their capabilities 
were growing, which made the new web protocol even more widespread. 


The Web has become critical to scientific programming; in fact, it started there. The Web and 
its associated protocols such as HTTP were originally developed at a high-energy physics 

laboratory in Switzerland, CERN, and they have been heavily used in the sciences ever since. 
In biology, as elsewhere, the Web has become one of the principal means of communication. 


7.1.1 URLs 


The Web is essentially a two-part system of browsers and servers, in which browsers get 
results from servers and display them for the user. This type of architecture is called a 
client-server design, in which the client (web browser) requests service from the server (web 
server). Web browsers and servers are just programs that run on computers. They may both 
be on the same computer, or, thanks to the Internet, they may be on opposite ends of the 
earth. 


In order for this scheme to work, the web browser has to be able to send its request to the 
web server. For instance, say you want to see the New York Times from your Internet 
Explorer web browser (or your Netscape, Mozilla, or other web browser). You have to know 
the location of the New York Times on the Web and type it into the space provided in your 
browser screen. 


So, you type http: //www.nytimes.com, hit the Return or Enter key on your keyboard, and 
the next thing you know you're reading the latest articles about human cloning and 
double-stranded RNA. How does this work, exactly? 


The answer is really very simple. The web browser sends your request to the Internet; the 
actual location of the desired computer is determined, and your request is sent to the web 
server program on that computer. The web server handles your request and sends back a 
web page your browser then displays. This web page may include other URLs of specific 
articles. You can click on one, and the whole process is repeated, but this time your request is 
for a specific article, which is then returned to your computer and displayed by your web 
browser. 


Behind this simple overall architecture are several steps. A basic familiarity with some of these 
steps and the associated terminology is needed in order to learn the fundamentals of web 
programming. 


The location you typed in, http://www.nytimes.com, is called a Uniform Resource Locator or 
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URL. The Internet (to which you must be connected for this to work, of course) takes the 
URL you typed in and, with the help of a network of computers and their routing tables that 
are configured for this task, resolves the URL into an Internet address (IP address), which is 
numeric. The address you typed in has a vague resemblance to English; www.nytimes.com is 
translated by routing tables into the numeric IP address. 


The details of this are not important; if you know it, you can also just type in the numeric 
Internet address instead of the domain name. The advantage of this design is that it allows 
the routing tables maintained by the Internet to take a domain name and translate it to the 
correct actual Internet address. Then, if the New York Times changes its main computer, or 
decides to move to Paris,” all that needs to be done is for the routing tables to be updated 
with the new actual Internet address. You can still type www.nytimes.com and get to the 
proper web server without worrying about where it actually lives on the Internet. 


"1 The Paris Review moved from Paris to New York, after all. 


A URL can have several parts: 


e It begins with a scheme, which is "http" in this case and specifies the protocol for the 
request. "http" is the most common scheme; others include "https" for increased 
security, "ftp" for file transfer protocol, and so on. 


e Acolon and two forward slashes (://) separates the scheme from the hostname, 
which is www.nytimes.com. This is the part that's resolved by the routing tables on the 
Internet; it gives the address of the computer with which you actually want to 
communicate. 


e Following the hostname, several other bits of information may optionally appear in the 
URL, such as a particular location on the server's computer, a port number, some 
parameters to pass to the server, and other details that tell the server exactly what 
the browser is requesting. 


For instance, say you want to use the web page the reporters for the New York Times 
use to organize their list of helpful web sites. You'd type in 

http: //www.nytimes.com/navigator. Now, following the hostname is the additional 
information /navigator. This is a pathname for the particular web page you're 
interested in. It's just like a pathname of a file or directory on your filesystem. 
Sometimes it is longer, such as 

http://www. nytimes.com/library/tech/reference/cynavi.html, and includes several 
directories and subdirectories, and finally a filename (in this case, it's cynavi.htm/). 


These path names are relative to the way the web server is installed and configured; 
they tell the web server exactly what resource is being requested by the browser. 


e Following the pathname may come other information. If the pathname is the name of 
a CGI program (discussed later in this chapter), certain arguments may be sent to that 
program; of course, these vary depending on the particular CGI program involved. The 
arguments or queries are separated by question marks and give desired values to 
parameters. A typical example might be 
http://www. mycomputer.com/cgi/rebase.cgi?enzyme=EcoRI ?enzyme=HinDIII. This 
requests a web page from the web server on the computer www.mycomputer.com. 
The web page to be returned is generated on the fly by the CGI script on that 
computer in the file cgi/rebase.cgi. The URL also passes that CGI script the names of 
two enzymes (which the script will presumably use to formulate its reply), EcoRI and 
HinDIII. 


Other information may appear in a URL, and other variations are possible. As one more 
example, if you had a web page saved on your computer in the file 
/home/tisdall/arabidopsis.html, you can display it by typing the following into a web browser 
running on the same computer: file:/home/tisdall/arabidopsis.html. 
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If you have to manipulate URLs in your program (and you very well may at some point), 
there is a collection of modules available on CPAN called URI: : URL that will make your life a 
whole lot easier. 


7.1.2 HTML 


The Hypertext Markup Language (HTML) is the language that embellishes text so that it can 
be displayed in a web browser. 


There are two important parts of HTML. It formats text, specifying such things as paragraphs, 
italics, numbered section headings, and the like. Although text is the most common type of 
information displayed, other types of information such as images and sound are also 
commonly incorporated into a document. 


The other important part of HTML is that it incorporates hypertext links, which make a 
document interactive by providing the user viewing the document in a web browser the ability 
to click on links and go to other web pages. 


The basic idea of HTML is to embed within a document directions for how to display the 
document. The directions are rather vague, compared to real typesetting tools such as 
FrameMaker or Quark. HTML commands may be interpreted differently by different web 
browsers so that your HTML document can look considerably different when viewed by 
different people. This limitation was a deliberate part of the design of HTML and web 
browsers. The disadvantage of not being able to exactly specify how a web page appears is 
offset by the advantages of the simplicity of HTML and the possibility to view HTML 
documents on a variety of computers and operating systems. 


7.1.2.1 HTML web page example 


To demonstrate, let's see a short example of an HTML web page: 


7] The Rebase web page that I'll develop in this chapter will give you a more complete example. Most web 


browsers allow you to see the HTML for whatever web page you're viewing by clicking on the Page Source link 
in the View menu of the web browser. (Your browser may use slightly different names, but all the major web 
browsers enable you to look at the HTML source by selecting a menu item.) 

<html> 

<head> 

<title>Double stranded RNA can regulate genes</title> 

</head> 

<body> 

<h2>Double stranded RNA can regulate genes</h2> 

<p>A recent article in <b>Nature</b> describes the important 

discovery of <i>RNA interference</i>, the action of snippets 

of double-stranded RNA in suppressing gene expression. 

</p> 

<p> 

The discovery has provided a powerful new tool in investigating 

gene function, and has raised many questions about the 

nature of gene regulation in a wide variety of organisms. 

</p> 

</body> 

</html> 














This HTML, if contained in a file, can be displayed in a web browser. If the file is on the same 
computer as the web browser you're using, you can display it easily. If it's on a different 
computer, the file has to be in a place your computer's web server has been configured to 
look. 


For instance, if a file on your computer /home/tisdall/htmlexample1.htm/ contained the 
previous HTML content, you can type the URL into your web browser as so: 


file: /home/tisdall/htmlexamplel.html 
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and the browser would display something like that in Figure 7-1. 


Figure 7-1. HTML example 


FJ Dousle svandec RNA cen regulate genes - Mozilla (Bulle ID; 200203132} 
" File Eom View Search Go Bookmarks Tasks Help Deoue QA 


: 2 = 
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Back Reload Prat 


Double stranded RNA can regulate genes 


A recent article m Nature describes the important discovery of RNA interference, the acton of snippets of 
double-stranded RN A in suppressing gene expression. 


The discovery has gravided s powerful new toal in investigating gene function, and has raised many questions abaut 
the nature of pene regulation in a wide variety cf organisms 


Document Done (0.14 secs) 








I say "something like this" because many of the details of exactly how the text and layout 
appears are left to the browser program. The browser program may be set to use different 
font sizes or font types, break the lines at different places, display a different colored 
background, and, in general, specify locally several of the formatting options for the web page 
to be displayed. For example, the browser window may be very small, in which case the text 
will be reformatted to fit as well as possible into the available window size. Still, the basic 
content of the text should appear similarly to what is shown in Figure 7-1. 


7.1.2.2 HTML directives 
Let's take a look at how the HTML directives are embedded into the document. 


HTML directives are mostly specified by enclosing them in angle brackets. The directives come 
in pairs, and the text between the opening and closing directive is affected. The second 
member of a pair has an added forward slash / before the tag name. 


So, for example, to make a word italicized, you surround the word with the <i> and the </i> 
pair of tags. In the previous example, the term "RNA interference" is surrounded in this 
fashion, so it appears in italics in the browser. 


The pair of tags <html> and </html> surrounds the entire document, and serves to delimit 
the HTML content for the web browser (or other HTML-reading program). 


HTML documents have two major sections: the head and the body. The pair of tags <head> 
and </head> surrounds text that is related to the document as a whole. In Figure 7-1, there 
is only one item in the head sectiona "title" that is displayed in the titlebar of the web browser. 
The title tags <title> and </title> surround the title "Double stranded RNA can regulate 
genes". 


The head section can contain many different kinds of directives that influence the display of a 
document. It is followed by the "body" of the document which is surrounded by the tags 
<body> and </body> and comprises the rest of the document. 


The body, in this simple example, has a header, paragraphs, and a few formatting directives, 
and it is surrounded by the tags <body> and </body>. 


The headers can be of different levels, so you can make a document structure with primary 
headers and various subsections. This example specifies just a single header as follows: 


<h2>Double stranded RNA can regulate genes</h2> 





The first paragraph makes the journal name "Nature" appear in bold font, and the new term 
"RNA interference" appears in italics: 


<p>A recent article in <b>Nature</b> describes the important 
discovery of <i>RNA interference</i>, the action of snippets 
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of double-stranded RNA in suppressing gene expression. 
</p> 


Notice how the paragraph tags <p> and </p> surround a paragraph. (Actually, the closing 
paragraph tag can be omitted as a time-saving convenience; some very common tags have 
this feature, but most do not.) 


The second paragraph contains only text: 

<p> 

The discovery has provided a powerful new tool in investigating 
gene function, and has raised many questions about the 

nature of gene regulation in a wide variety of organisms. 

</p> 





The following summary of the document highlights the major sections and omits the details 
within the head and the body: 


<html> 


<head> 
. header information goes here 
</head> 


<body> 
the body of the document goes here 
</body> 





</html> 


That's all there is to say about this simple example. Other features of HTML include embedded 
hyperlinks to web pages, email, and so forth. HTML has expanded in several ways over the 
last few years, and many more types of formatting are possible. 


7.1.3 HTTP 


The Web is based on a language called the Hypertext Transport Protocol, or HTTP. HTTP is the 
protocol that communicates between web browsers and servers. 


Recall that in this chapter I'm using CGI to handle the communication between browsers and 
servers, and CGI can be thought of as a simplified interface to HTTP. So, it's necessary and 
useful to learn a few basic facts about HTTP before embarking on CGI programming. 


HTTP works in a simple fashion. The browser sends a request which is made of a header and, 
often, a body. The server receives the request and sends a response, which is also made of a 
header and, sometimes, a body. 


The first line of a request header is called the request line and contains the request method. 
The request method is usually GET. This is the most common request, and it asks for a 
specific resource from the web server, usually specified as a URL to be retrieved by a specific 
protocol such as HTTP. 


The remaining lines of the request header are called header fields and consist of name-value 
pairs, which include such items as the hostname the request is being sent from. 


The reply message also has a special first line called the status /ine which reports the protocol, 
a numeric code representing the specific response, and a text version of the response ("OK"). 


The remaining lines of the header contain name-value pairs of various other parameters. For 
instance, the name and version of the web server may be specified. 


After the reply header may come the body of the response. This is always separated from the 
reply header by a blank line (actually a carriage return and line feed). In this case, the body of 
the reply is exactly the HTML code for the simple web page concerning RNA interference 
shown earlier. 
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There are many name-value pairs I have not mentioned and many other details that can have 
significance within this basically simple HTTP protocol scheme. However, this overview gives 
you the basic idea and the essential structure of the protocol that is exchanged between the 
web browser and the server. 

[ Team LiB ] 
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7.2 Web Servers and Browsers 


The whole architecture of the Web is based on the client-server, request-response protocol of 
HTTP. This colors everything that you do when you create interactive web pages. 


Web browsers come in many flavors from many companies (or open source programming 
groups) and each has its own idiosyncrasies. Here, I'll stick to the basics, and the code should 
display okay on any standard web browser. 


As you go forward, keep in mind the problem of incompatibility among web browsers and 
realize that at some point, it may become necessary for you to deal with the problem in the 
code you write for the Web. 
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7.3 The Common Gateway Interface 


The preceding sections of this chapter have presented a brief overview of the design of the 
Web, the principal components of the programming environment such as HTTP, HTML, and 
URLs, and of the essential request-response nature of web interactions between web 
browsers and web servers. Now, it's time to look at CGI and a specific Perl module, CGI.pm, 
that is widely used to create interactive web pages on servers. 


CGI, the Common Gateway Interface, is an interface between a web server and some other 
program that requests such web content as HTML documents or images. 


A web browser may request from the web server the output from a CGI program. In this 
case, the web server finds the program or script, runs the program, and sends the output of 
the program back to the web browser. The output of the program may be HTML, just as it 
may be found in a static file, but it is created by the CGI script dynamically, so it may be 
different each time; for instance, it may include the time of day in its display. 


In other words, a CGI script is just a program that produces web content that can be 
displayed in a web browser. It also can read information passed to it by the web server, 
usually parameters filled out by the user in a form displayed on a web browser that asks for 
the specifics of a query. For example, the parameters might be the name of a sequence file 
and the name of a restriction enzyme to map in the sequence. The CGI program takes the 
parameters, runs, and outputs a dynamically created web page to be returned to the user's 
web browser. 


A CGI program can be written in just about any language, but the most common for CGI 
programs on the Web is Perl. Perl has a very nice module called CGI.pm that eases the task of 
writing CGI scripts; it's a popular way to create dynamic web sites with a minimum of bother. 


7.3.1 Writing a CGI Program 


So, how do you write a CGI Perl program? Basically, you write a Perl program that includes 
the line: 


use CGI; 


You then use the CGI.pm methods so that your Perl program outputs the code for the web 
page you want to return. After that, it's simply a matter of placing your new CGI script in the 
proper place in the web server's directory structurenamely, in a directory that the web server 
knows is supposed to contain CGI scripts. Finally, you type in the name of the CGI script to a 
web browser as a URL. The web browser sends the request to the web server, which 
executes the CGI script, collects its output, and returns it to your web browser to be 
displayed as a web page (or image, sound, or whatever). 


You can actually write a Perl program that just prints out HTML code (without ever using the 
CGI.pm module) and install that as a CGI program. For instance, you can take the HTML page 
shown earlier and create a CGI Perl script that dynamically outputs that page. To prove it's 
dynamic, I'll add a little code that includes the time of day: 


#!/usr/bin/perl 


use strict; 
use warnings; 


my Stime = localtime; 


print "Content-type: text/html\n\n"; 





print <<EndOfHTML; 
<html> 
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<head> 

<title>Double stranded RNA can regulate genes</title> 
</head> 

<body> 

<h2>Double stranded RNA can regulate genes</h2> 

<p>A recent article in <b>Nature</b> describes the important 
discovery of <i>RNA interference</i>, the action of snippets 
of double-stranded RNA in suppressing gene expression. 

</p> 

<p> 

The discovery has provided a powerful new tool in investigating 
gene function, and has raised many questions about the 
nature of gene regulation in a wide variety of organisms. 

a oe 

<p> 

This page was created $time. 

</p> 

</body> 

</html> 

EndOfHTML 








Notice that the program just prints out the HTML code. It also prints a header line: print 
"Content-type: text/html\n\n"; before printing the HTML code as the body of the 
response. Notice the two \n's in that header line; these print a blank line between the header 
and the body of the response, as described earlier in this chapter. 


Also, notice a new last paragraph that reports the time. 
7.3.2 Installing a CGI Program 


After writing the program, it is necessary to install it in the cgi script directory of your web 
server. Because of the multiplicity of web servers and operating systems, it is not possible for 
me to be comprehensive on this point. On my Linux system, using the Apache web server, I 
simply became superuser (root), copied the script (called cgiex1) into the directory 
/var/www/cgi-bin, and then typed: 

chmod 755 /var/www/cgi-bin/cgiexl 


If you're working on a Mac OS X, the procedure is similar. If you're on a Microsoft Windows 
machine, the details are a little different; consult the documentation for your web server to 
see how to install a CGI script in the appropriate place. 


Once I've installed the script, I simply entered the following URL into my web browser. Notice 
that the URL gives the hostname as localhost, which means the web server is on the same 
computer on which I'm using the web browser. 


http: //localhost/cgi-bin/cgiexl 


I hit the Enter or Return key, and the web server returned the web page that's displayed in 
my web browser (see Figure 7-2). 


Figure 7-2. Web page for cgiex1 
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Notice that it's exactly the same as the previous version that just read a file, but this time, 
the current date on the web server is also being reported. So each time you run this program, 
you'll get a different output (as regards the date, that is). This qualifies the program as 
dynamic. 


I'll go into more detail about CGI installation in the next section. 
7.3.3 Using the CGl.pm Module 


The following program was written using CGI.pm; it has the same output as the example in 
the previous section. Notice how almost the entire contents of this CGI script are a Perl print 
function with a list of arguments, ending with the argument end_html. The various arguments 
to print are either CGI.pm functions or text strings: 


#!/usr/bin/perl 


use strict; 
use warnings; 


use CGI qw/:standard/; 
my Stime = localtime; 


print 
header, 

start _html('Double stranded RNA can regulate genes'), 

h2('Double stranded RNA can regulate genes'), 

start form, 

Pr 

"A recent article in <b>Nature</b> describes the important 
discovery of <i>RNA interference</i>, the action of snippets 

of double-stranded RNA in suppressing gene expression.", 

Pr 

"The discovery has provided a powerful new tool in investigating 
gene function, and has raised many questions about the 

nature of gene regulation in a wide variety of organisms.", 

Pr 

"This page was created Stime.", 

Pr 

end_form; 

















This program uses the most common routines defined in the CGI.pm module, as imported 
into the program's namespace by the directive use CGI qw/:standard/;. 


The function header prints the header information discussed earlier in this chapter; it takes as 
an argument the document type and assumes the type is text/html by default. The function 
start html starts the HTML and gives the title of the document (which most web browsers 
display in their titlebar above the document). The functions h1, h2, and so forth give the 
different levels of HTML headers in the document structure. The function p starts a new 
paragraph of text. Finally, the function end_form closes the HTML document. 


Here is the body of the document the web browser receives for display from the CGI 
program cgiexl.cgi: 


<html> 

<head> 

<title>Double stranded RNA can regulate genes</title> 
</head> 

<body> 

<h2>Double stranded RNA can regulate genes</h2> 

<p>A recent article in <b>Nature</b> describes the important 
discovery of <i>RNA interference</i>, the action of snippets 
of double-stranded RNA in suppressing gene expression. 
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<> 

The discovery has provided a powerful new tool in investigating 
gene function, and has raised many questions about the 

nature of gene regulation in a wide variety of organisms. 


</p> 

<p> 

This page was created Tue Apr 15 09:42:49 2003. 
</p> 

</body> 

</html> 


This simple web page is not much less complicated than the previous version cgiex1 that 
didn't use CGI.pm but simply output the HTML code. However, as you write more complicated 
web pages, with forms to be filled in by the user, choices to click on, and a button to push to 
submit a request, you'll see that using CGI.pm can significantly ease your programming work. 


7.3.4 Testing a CGI Program 


Let's assume the CGI scripts are where they should be, ownership and permissions have been 
assigned, and now your web server can find and attempt to execute them. So how do you 
test a CGI web program? 


First, check the basic syntax by running: 

perl -c cgiexl.cgi 

and, hopefully, getting the message: 

cgiexl.cgi syntax OK 

If not, you can save a bit of trouble by at least fixing the syntax of your program before 


installing it in the web space, where you have to test it with the web browser and web logs 
and so forth, as I'll demonstrate in a moment. 


Copy your CGI program cgiex1.cgi into your CGI directory (on my system, it's 
/var/www/cgi-bin). I did this as the user root. Then, still as root, I made the program 
executable by typing: 

chmod 755 /var/www/cgi-bin/cgiexl.cgi 


Let's try the program out. Start up a web browser and type in the URL 
http://ocalhost/cgi-bin/cgiex1.cgi and hit the Enter key. Figure 7-3 shows what it looks like. 


Figure 7-3. The results of a successful CGI program 
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It worked! But what would you do if it doesn't? Even though you check the syntax, the 
program may have had some other problem that caused it to fail. For this demonstration, I 
made another version of the program with a missing semicolon near the beginning of the 
program, called cgiexlouch.cgi. When I ran it, I saw something like what's in Figure 7-4. 
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Figure 7-4. When things go wrong 
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The best way to proceed is to check the error logs for the web server to see if they give any 
useful hints as to why the program failed. 


Again, this will be different on different systems. On Linux, Unix, and Mac OS X, some 
variation of the following will work. I determined where the error logs for the web server are 
kept on my system, which is in the directory /etc/httpd/logs, and the most recent error log 
there is called error_log. I opened a new command window and typed the following 
command: 


tail +0f£ /etc/httpd/logs/error log 


This prints out the error file, and then waits at the end. Whenever anything new is printed to 
the error file, it prints it out too. So, by hitting the Return key a few times to make space 
after what went before, and then trying to run the program again from the web browser, I 
can see clearly what new error messages resulted. 


Here's what I saw in this example: 


syntax error at /var/www/cgi-bin/cgiexlouch.cgi line 6, near "use warnings 


use: CGL ™ 

Execution of /var/www/cgi-bin/cgiexlouch.cgi aborted due to compilation 
errors. 

[Tue Apr 15 21:23:10 2003] [error] [client 127.0.0.1] Premature end of script 
headers: 

/var/www/cgi-bin/cgiexlouch.cgi 





Sure enough, I'd removed the semicolon at the end of the use warnings statement. 


For most web programming jobs, this is all you'll need, because the error log will show you 
the error output of the program. If it's a difficult problem, you can even put extra print 
statements in your CGI scriptin the standard way they are used to debug a misbehaving 
program. For instance, you might see if a program ever gets to a certain line by placing 
directly after that line the statement: 


print STDERR "Got to here!\n"; 





This message will appear in the error logs (if your program gets to that point before it dies). 
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7.4 Rebase: Building Dynamic Web Pages 


The simple examples in the previous sections showed how to load and use the CGI.pm 
module to display a very simple web page and how to examine the error logs of the web 
server to help debug a CGI program that doesn't display properly. 


The real power of CGI comes from its ability to provide dynamic contentweb pages that may 
display different information depending on such factors as when they're called, such as the 
date and time in the previous example. Dynamic content also handles the requests of users 
that are entered by typing in text fields, clicking on so-called "radio" buttons, selecting from 
lists, or other ways of inputting. 


In this section, I'll show you how to use some of the modules from previous chapters, 
combined with the use of the CGI.pm module, to make an interactive, dynamic web page for 
displaying restriction maps. In this web page, the user will select which restriction enzyme or 
enzymes to search for and specify the sequence to search either by entering the sequence 
data into a text window or by browsing for the file that contains the sequence. 


Here is the short CGI program, webrebasel, that accomplishes this. The main reason that it's 
short is because I've already developed modules for reading sequence files, for accessing the 
Rebase database, for calculating restriction maps, and for displaying the maps with simple 
text graphics. I can just reuse those modules here to accomplish my task: 


#!/usr/bin/perl 


webrebasel - a web interface to the Rebase modules 





To install in web, make a directory to hold your Perl modules in web space 
use lib "/var/www/html/re"; 


use Restrictionmap; 
use Rebase; 

use SeqFilelI0O; 

use CGI qw/:standard/; 


use strict; 
use warnings; 








print header, 
start _html('Restriction Maps on the Web'), 
hl('<font color=orange>Restriction Maps on the Web</font>'), 
ne, 
start _multipart form, 
"<font color=blue>', 
h3("1) Restriction enzyme(s)? "), 
textfield('enzyme'), p, 
h3("2) Sequence filename (fasta or raw format): "), 
filefield(-name=>'fileseq', 
-default=>'starting value', 
-size=>50, 
-maxlength=>200, 
)r Py 
strong(em("or")), 
h3("Type sequence: "), 
textarea ( 
-name=>'typedseq', 
-rows=>10, 
-columns=>60, 
-maxlength=>1000, 
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), Py 
h3("3) Make restriction map:"), 


submit, p, 
</ FOmeES" > 
he; 

end_ form; 


if (param( )) { 


my Ssequence = ''; 


# must have exactly one of the two sequence input methods specified 


if (param('typedseq') and param('fileseq')) { 
print "<font color=red>You have given a file AND typed in sequence: 
do only one!</font>", hr; 
exit; 
}elsif (not param('typedsegq') and not param('fileseq')) { 
print "<font color=red>You must give a sequence file OR type in 
sequence! </ 
font>",. hr? 





exit; 
}elsif (param('typedseq')) { 
Ssequence = param('typedseq'); 
sJelsif (param('fileseq')) { 


my Sfh = upload('fileseq'); 

while (<Sfh>) { 
/*\s*>/ and next; # handles fasta file headers 
Ssequence .= $ ; 


} 


# strip out non-sequence characters 


Ssequence =~ s/\s//g; 
Ssequence = uc S$sequence; 


my Srebase = Rebase->new ( 
fomit "bionetfile" attribute to avoid recalculating the DBM file 
domfile => 'BIONET', 
mode => '0444', 











i 


my Srestrict = Restrictionmap->new ( 
enzyme => param('enzyme'), 
rebase => Srebase, 
sequence => S$sequence, 
graphictype => 'text', 








ez 





print "Your requested enzyme(s): ",em(param('enzyme')),p, 
"<code><pre>\n"; 
(my Sparamenzyme = param('enzyme')) =~ s/,/ /g; 
foreach my Senzyme (split(" ", Sparamenzyme)) { 
print "Locations for Senzyme: ", 
join(' ', Srestrict->get_enzyme map (Senzyme)), "\n"; 


} 

print. “\ni\n\n"; 

print S$restrict->get_graphic, 
"</pre></code>\n", 

linear 
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print end html; 
7.4.1 Installing webrebase1 


Installing webrebasel is almost exactly the same as installing the scripts seen earlier in this 
chapter, such as cgiexl.cgi. However, because this program depends on several modules, 
it is necessary to copy them into a directory that can be found by the web server. It's 
possible to configure a web server to look into any directory; you may prefer to leave your 
modules in one place and give the necessary permissions to your web server to look there. If 
your code lives in one place, there won't be any problem with out-of-sync duplicate copies. 


However, there are security problems associated with letting the world execute programs 
from your own directories. Also, if you try out a change that doesn't work while you're 
tinkering with your code, any users on the web site will find the programs broken as well. So, 
often it does make sense to have a development area and a production area where you try 
to ensure that only working, tested, and secure programs are placed for public consumption. 


On my Red Hat Linux system, I created a directory /var/www/html/re and copied the 
modules Restrictionmap.pm, Rebase.pm, and SeqFileIO.pm there. On your system and 
web server, you may need to check that the existence, ownership, and permissions on that 
directory and those module files are suitable for your web server's configuration. 


I then copied my CGI program webrebasel into my CGI directory (on my system, 
/var/www/cgi-bin), and dealt with the same questions of ownership and file permissions as 
detailed in earlier sections of this chapter. (As they say down at the dealership, your mileage 
may vary depending on the operating system and web server that you are using.) 


It's possible that the first line that invokes the Perl application, #! /usr/bin/perl, may have 
to be changed to run from your web space. Sometimes a web server is configured to restrict 
calling any programs from outside the web space, and a Perl application must be installed into 
the web space (usually with such extra security precautions as taint checks compiled into the 
application). If you plan to offer programs to the world from a computer that has sensitive 
information or is connected to other computers that have sensitive information, such 
precautions are often desirable. But just to get started, try using the same Perl application 
you've been using, and it's likely to work. 


One consequence of running the Restrict.pm module is that the DBM file called bionet is 
created in your web server's CGI directory (/var/www/cgi-bin on my Linux system running an 
Apache web server). So, another thing to check is whether you have enough space for any 
files your programs may create in your web space. 


7.4.2 Inside webrebase' 


webrebasel first loads the required modules: the ones you've written and the standard 
CGI.pm module. 


The program has two parts. The first part is always executed, and displays the form that asks 
the user to enter the required information to run the program. The second part executes only 
when the program is called with parameters set, which happens after the user has filled out 
the form and hit the Submit Query button. 


Let's look at the first part of the code that creates the form. Everything in this part of the 
form is one long print statement. The list of things to print is composed mostly of calls to 
various CGI.pm functions. For details on these functions, take a look at the CGI.pm 
documentation on www.perldoc.org or by typing perldoc CGI at a command prompt. 





Here are the CGI functions called but not seen in the earlier programs: 
h1l('<font color=orange>Restriction Maps on the Web</font>') 


This is a header, as seen previously; however, it includes a color directive for the font. 


start_multipart form 


Page 263 


Makes the part of the form that handles file uploading work correctly. 


Draws a horizontal line across the screen. 
‘<font color=blue> 

Makes everything in the form blue, up to the closing directive '</font>'. 
h3("1) Restriction enzyme(s)? ") 


This header, like similar headers in the form, labels the following textfield so the user 
knows what information is requested. 


textfield('enzyme') 





textfield creates a place for the user to type in a line of text. The string the user 
types in is accessible by means of the parameter named enzyme when the form is 
submitted. The user can type in the names of enzymes such as EcoRI and HindIII to 
find more information. 


This starts a new paragraph in the form. 


filefield(-name=>'fileseq', -default=>'starting value', -size=>50, 
-maxlength=>200, ) 





filefield provides a way to give the name of a file that contains a sequence. When 
the form is submitted, that file is uploaded from the user's computer onto your 
computer where it can be used to find the restriction map. As you can see in Figure 
7-5, the user can type in the pathname of a sequence file or use a mouse to 
interactively browse until the desired sequence file is found. 


The option name gives the name of the parameter that has the contents of the file; 
size and maxlength are the size of the field displayed and the maximum length of the 
filename. 


Figure 7-5. Rebaseil in a browser window 


Page 264 
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Type sequence: 
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Submit Query 
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strong(em("or") ) 
This prints the word or in some strong fashion (usually in a bold font, but at the 
discretion of the user's web browser) and with some emphasis (usually in italics). 
textarea( -name=>'typedseq', -rows=>10, -columns=>60, -maxlength=>1000,) 
textarea provides a box in which the user can type or use the mouse to cut and 


paste the sequence directly, as opposed to giving the name of a file that contains the 
sequence. 


submit 
This button collects the values the user has given on the form into the named 
parameters and restarts the program by submitting the form to the web server, this 
time with the parameters set. webrebasel has a section that uses the parameters to 
perform a computation, as you will see shortly. 


end form 
This closes the start_multipart_form given earlier. 
end_ html 


This CGI function is called at the end of the webrebasel program, and it prints the final 
required HTML tags for the page before sending it back to the user's web browser. 


When the user hits the Submit button, the parameters are assigned the values the user has 
indicated, and the program is called again. This time, after printing the form, the program gets 
to the conditional block beginning: 


if (param( )) { 
The param( ) CGI function returns a true value if parameters have been sent to the program, 


so at this point the block is entered. The block does some error checking, extracts the 
information from the parameters, computes the restriction map, and displays the results. 


The error checking ensures that all the data needed from the parameters for the computation 
to proceed is present. 
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Assuming the parameters have been set correctly, the program gets the sequence to be 
mapped from either the typed-in textbox field: 


elsif (param('typedseg')) { 
Ssequence = param('typedseq'); 
or from the uploaded file: 
}elsif (param('fileseg')) { 
my Sfh = upload('fileseq'); 
while (<Sfh>) { 
/*\s*>/ and next; # handles fasta file headers 
Ssequence .= $ ; 


} 


As you can see, the uploaded file is provided as an opened filehandle to your webrebasel 
program. The while loop assumes that the file is in FASTA format (see the exercises for this 
chapter), skips the header, and collects the sequence. 


After cleaning up the sequence by stripping out newlines and making it uppercase, the 
program then calls the Restrict and Restrictionmap modules to calculate the restriction 
map with the requested enzymes as available by means of the param('enzyme') CGI 
function call. 


Finally, the program is ready to display the results. As you can see in Figure 7-6, the results 
appear after the form. First, webrebasel prints out the names of the requested enzymes: 





print "Your requested enzyme(s): ",em(param('enzyme')),p, 


Figure 7-6. Results of the Rebasel1 query 
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EcoRI 
SCOCOCATCTACTGOCATCTGCOGAATTCTOOACATCAACTOCTTCATCAT 


COOESTETGACAACTOCAATGASTOOTTCCATGOSGACTOCATOCOGATOA 


CTCAGAAGATOGOCAAGOCCATOCOGCAGTOGTACTSTOOGCAGTSCAGA 


te Ls 2 (2) @ Document Done (0.489 sa. 








Recall that the trailing ,p, is a CGI directive to start a new paragraph; the em( ) CGI function 
asks the user's web browser to emphasis the text, probably by italics. 


The next line is not part of CGI but is an HTML directive that ensures the following lines are 
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printed in a fixed-width font. Every character will thus take the same amount of horizontal 
space, and all lines and spaces will line up just as they do when they're printed to your screen. 
Without this directive, the web browser could use a non-fixed font, and the map would 
improperly display: 


"<code><pre>\n"; 


Of course, as you now know about HTML tags, they are almost always required to appear in 
pairs, so this directive has a closing tag after the map is displayed: 


"</pre></code>\n", 


The restriction map for each enzyme, by which I mean the simple list of locations for each 
enzyme in the sequence, is displayed by the following code: 
(my Sparamenzyme = param('enzyme')) =~ s/,/ /g; 
foreach my Senzyme (split(" ", Sparamenzyme)) { 

print "Locations for Senzyme: ", 

join(' ', Srestrict->get_enzyme map (Senzyme)), "\n"; 
} 


print "\n\n\n"; 

First the space- or comma-separated list of enzymes is collected from the parameter enzyme 
, and the commas, if present, are removed. The enzymes are then split (on whitespace) into 
a list, and for each such enzyme, the Restrictionmap method get_enzyme_map is called to 
display the list of locations. 


Finally, the graphic map is displayed. This is accomplished, as before, by a simple call to the 
Restrictionmap method get_graphic, which returns a simple text version of the graphic 
(because graphictype => 'text' was specified when Restrictionmap object $restrict was 
created). 

print Srestrict->get_graphic, 
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7.5 Exercises 


Exercise 7.1 


Use your web browser to examine the actual HTML code for some of your favorite 
web pages. For example, on my Netscape browser, once a page is displayed, I can 
select the menu options View and then Page Source to see the HTML code. 


Exercise 7.2 


Identify the web server that is running on your computer. What version of the 
software is installed? Find the documentation for that web server. Are there good 
books available for that web server; free online tutorials, newsgroups, or FAQs; or 
local experts? Locate and examine the various logs for your web server. Where are 
the configuration files? How do you stop the web server, change a configuration file, 
and start the web server again? Is it a good idea to make a copy of a working 
configuration file before you change it? 


Exercise 7.3 
Is your computer located behind a firewall or a proxy server? 
Exercise 7.4 
Get the URL: : URI module and read the documentation. 
Exercise 7.5 
Read the CGI module documentation. If available, look at a book about Perl and CGI. 
Exercise 7.6 


The sequence file uploaded by the CGI Perl script webrebasel is assumed to be in 
FASTA format. Rewrite the program so that it uses the SeqFileIO.pm module and 
reports a problem back to the web page if that module can't determine the sequence 
file format. 


Exercise 7.7 





Write a new version of the CGI script webrebasel that uses the RestrictDB.pm 
module instead of the Restrict.pm module; it will use a relational database store of 
Rebase information instead of a DBM hash-based store. Note that your DBMS will have 
to allow the web server to access its Rebase tables; this may occur as the user 
nobody or some other. Getting the permissions right is probably the hardest part of 
this exercise and is very much dependant on your operating system and web-server 
software. 


Exercise 7.8 


Write a web-based CGI Perl program that provides some information about your lab 
or an experiment that you've conducted. Include a means whereby users can send 
you email with questions. 
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Chapter 8. Perl and Graphics 


The importance of an attractive and useful graphical display of information is fundamental to 
scientific programming. Just imagine the difference between seeing a graph of this week's 
stock market activity, and seeing a list of several pages of numbers describing the same 
activity. The former can be surveyed in a glance; the latter requires considerable study. The 
same principle applies to the display of results in bioinformatics. Of course, sometimes you 
need the raw data; but very often what you need is the overview that a good graphic can 
provide. 


Although there are many programming systems to produce graphics, the most popular are 
those that can be displayed via web browsers. This chapter builds on what you learned in 
Chapter 7 by showing you how to add graphics to your display of scientific results on a web 
page. 


My vehicle of choice for displaying scientific data is the GD.pm module, which allows you to 
produce images in the standard graphics formats PNG (Portable Network Graphics) and JPEG 
(Joint Pictures Expert Group), among others. I'll also briefly mention the graphics packages 
ImageMagick and Gimp, two more full-featured graphics creation and manipulation packages. 


GD.pm is an interface to the gd graphics library written in the C programming language by 
Thomas Boutell. The gd graphics library is fairly basic; it's fast, it handles text and various 
kinds of lines and space filling by means of its various functions, and it is easy to program 
elements such as graphs. 


The main reason for its success, however, based on those important technical reasons, is 
that it makes it possible to produce nice-looking graphics from your web site on-the-fly, in 
immediate response to a request from a user submitting a form. 
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8.1 Computer Graphics 


Computer graphics programming is a specialty within computer science, and it has its own 
journals and conferences and laboratories. It encompasses just about everything you can see 
on a computer, from the fonts with which letters are drawn (both on the screen, and when 
computer typeset and printed like this book), to elaborate images, animation, and the special 
effects in movies. 


Computer graphics images are represented as data in the computer; they're usually 
organized in files, one per image, although they can also be stored in a computer program's 
memory, as I did when storing the graphics for the Restrictionmap.pm module in a scalar 
variable. Some kinds of animation store several images in one file, but I won't cover those 
kinds of image files here. 


For the purposes of this chapter, we won't be digging into the details of how graphics files are 
designed; that level of detail isn't possible in a single chapter. Here, I'll just discuss the very 
basic information you need to get started generating and displaying images for your web site. 
I'm going to show you some easy (and inexpensive) ways to generate graphics, enough to 
set your feet on the right path. The modules I use have methods that handle the details of 
reading and writing the different file formats, so you can simply use the methods and ignore 
the details of the formats themselves. 


8.1.1 Basic Graphics Concepts 


There are two things you need for computer graphics: a program to create a graphics file and 
the correct software and hardware to display the image. These days, digital photography may 
be the source of digital images, or a printed image may be entered into digital format by a 
scanning device. As computer technology has developed, many graphics display devices and 
file formats have seen some use and then virtually disappeared as they are supplanted by 
other systems.™ However, even though there are many file formats that are currently in use, 
a few standard formats are widely known and can be displayed by most graphics display 
programs. The PNG and JPEG files I use here are very commonly known. 


[4] One of my first programming jobs was to write vector graphics for a Tektronix oscilloscope display. Vector 
graphics are specified by endpoints of lines. Although such graphic programs are still in use, they are almost 
always converted to raster graphics for display these days. The same year, I wrote a large graphics program 
to write and play computer music on a Blit terminala raster graphics display that was the first to have 
modern-style simultaneously updated windows. Both systems have faded away, but the software they were 
written in was ported to other display devices. 


Although I won't go into any detail about graphics file formats, there are a few basic facts you 
need to understand because you'll be asked to specify some of these parameters when you 
use a graphics library (Such as GD.pm). For instance, in the Perl code that follows, you'll see a 
method colorAllocate that specifies and enters a color into a color table. The few short 
comments that follow in this section are meant to introduce these, and other common 
terms, and to give brief definitions. 


These days the standard graphics formats are mostly some variation of a raster image, in 
which a rectangular image is described by a two-dimensional matrix of pixel values. A pixel 
(short for "picture element") is one glowing dot on your computer screen. These values can 
be represented in different ways, from a simple 1 or 0 representing black or white (called a 
bitmap), to a number representing various shades of grey, to various color image schemes. A 
computer display may have different numbers of pixels overall; for instance, the one I'm 
working on now has 1024 horizontal and 768 vertical rows. Displays (or printers) are also 
rated by how closely packed the pixels are, usually by saying they have so many DPI or dots 
per inch. 


Colors are typically represented in one of three ways: 
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RGB 


Colors might be represented in the RGB fashion by three numbers indicating the values 
for the primary colors red, green, and blue. All colors can be formed by mixing these 
values in different quantities. A typical standard is 24-bit color, also called truecolor, in 
which each of the three primary colors is represented by an 8-bit value (256 possible 
values per primary color) for a total of over 16 million colors. This is well in excess of 
the 7 million colors most humans can distinguish, and as a result, a 24-bit color 
scheme is very popular, although many computer displays can handle only a smaller 
number of colors. 


Color table 


CMYK 


Colors are also sometimes represented by a color table (also known as a palette), 
which uses 8 bits per color at each pixel, and uses the values as an index into a 
256-slot table for a total of 256 colors. In this way, the image itself only needs to 
have 8 bits per color, instead of 24 bits, so the image can be much smaller as data. 
The drawback, of course, is that there are only 256 possible colors instead of 16 
million; also, an image needs a color table in addition to the pixel values. Still, this 
scheme works quite well and is frequently used. Many computers give you the option 
of displaying using a color table or using truecolor. 


The third popular way to represent colors is the CMYK color space, a subtractive color 
scheme used mainly in high-quality printing. The name CMYK comes from the colors 
cyan, magenta, yellow, and black, which can be combined to represent the range of 
colors, similar to the way that RGB does. I won't discuss this scheme further here. 


8.1.2 Graphics and File Formats 


I'm going to generate two kinds of graphics files in this chapter, PNG and JPEG. There are also 
many different file formats for computer graphics. Some of these different formats amount to 
little more than trivial variations in the header section of the files; others represent quite 
different ways of storing and displaying images on a computer screen. 


PNG 


JPEG 


Portable Network Graphic is a raster format that has become especially popular since 
the late 1990s when legal problems arose with the use of the very widely used GIF 
format. PNG can be used to create indexed color images (via the color table 
mentioned before) or to create 24-bit truecolor images. Because it does not perform 
compression on the image data, PNG is good for saving large images such as 
photographs because none of the photographic data is lost saving it in PNG format. 
But because images like photographs tend to be fairly large, and hence slow to display 
over the Web, PNG is best used for generating and displaying simpler graphics. 


Joint Photographic Experts Group is a popular format for displaying images, especially 
such large images as photographs. JPEG performs data compression, and although 
some of the picture quality is lost, the result is usually a good quality photographic 
image that is much smaller than the uncompressed original image. This has led to JPEG 
being widely adopted on the Web, where smaller picture sizes can dramatically 
improve the time it takes to download and display an image. (Displaying a JPEG image 
also requires decoding the compression, but that is a fairly fast operation on modern 
computers compared with the time it often takes to download a file.) However, the 
drawback to JPEG is exactly the same as its advantage: the compression. If you have 
a JPEG image, and want to do some editing of the digital image, you have to decode 
the image to perform the edits and then recompress it. But once an image is 
compressed, some of the original image data is lost. This compression/decompression 
will cause the loss of image data to become more apparent; several cycles of it can 
result in a noticeably poorer quality image. So it's best to save the original image in a 
noncompressed format, which can undergo editing without major loss of quality, and 
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then create JPEG images directly from an uncompressed version. 
GIF 


An older file format which was standard on the Web and is still widespread, GIF 
(Graphic Interchange Format), has fallen afoul of some proprietary licensing issues; 
most Perl programmers have shifted to the similar PNG format which is in the public 


domain. 
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8.2 GD 


GD is an excellent, relatively simple graphics toolkit for creating or modifying graphic images. 
You may create new images or read in images from several formats to edit or merge. GD 
contains a number of graphics primitives, objects such as lines and rectangles, which can be 
added to an image. It also includes support for TrueType fonts. 


8.2.1 Installing GD 


To install GD.pm on your computer, you should first test to see if the module is already there: 
try typing perldoc GD to see if the documentation that is installed with the module is running 
on your computer. If not, GD. pm is available from CPAN, which is the best way to get it (its 
real home is at http://stein.cshl.org). 





The GD documentation explains which supporting libraries are necessary. In particular, you 
need to have Thomas Boutell's gd C library installed, and in the proper version, to work with 
the version of GD.pm you are installing. To work with different graphics file formats, the gd 
library is best compiled in the presence of additional libraries that handle PNG, JPEG, zlib 
compression, and the FreeType version of TrueType fonts. How to obtain and install these 
additional packages is all explained in the GD documentation. You may need extra time to get 
these various pieces into place before you can get GD to do everything it can do. 


So, even if you use CPAN to install GD, the installation requires some information about those 
additional packages; it's a good idea to look at the documentation first, and to get those 
libraries in place, before installing GD. To install on my Linux system, I become root and issue 
the following commands: 

perl -MCPAN -e shell; 

cpan> install GD 


8.2.2 Using GD 


The following list is a partial overview of GD's capabilities: 


e To create a new image, you call the new constructor with x and y values of the size in 
pixels of the desired image. You can also choose between a palette or a truecolor 
image. And you can either create a new image or open an existing image in a few 
common formats (such as PNG, JPEG, GD, and XPM). 


e You can output images as PNG, JPEG, WBMP (a bitmap image format), or its own GD 
format and the compressed version GD2. 


e The color table can be manipulated in several ways. A new color can be added or an 
old one deleted. The closest match in the color table to specified red, green, and blue 
values can be returned, and the RGB values of an existing color table entry can be 
determined. One color can be designated as transparent so that portions of an image 
drawn in that color are invisible. 


e Lines can be drawn using defined "brushes," or in certain defined styles (like dotted 
lines, for example). Shapes may be filled with repeated copies of another image, or 
"tiled." 


e You can draw individual pixels, lines, dashed lines, rectangles, polygons, and arcs. You 
can fill regions of the image (such as the interiors of rectangles) in various ways. 


e You can copy regions of images, changing size if desired; you can also rotate and flip 
images. 
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e Text can be written in built-in or TrueType fonts, and rotated. Various methods return 
information about the fonts. 


e A Perl module called GD: :Graph adds some excellent graph drawing capabilities on top 
of the GD module. 


In sum, GD.pm provides easy-to-program and attractive graphics capabilities for generating 
dynamic web pages (and for other uses as well). Now that you've had an introduction to the 
topic of graphics programming, and to the GD.pm module, I'll use GD. pm in a Perl script that 
generates images dynamically to display restriction maps. 
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8.3 Adding GD Graphics to Restrictionmap.pm 


Now that you've seen an overview of the GD. pm Perl module, let's enhance the 
Restrictionmap.pm module that you've seen previously in Chapter 5, Chapter 6, and 
Chapter 7 so it can output PNG graphics files. 





Reviewing how the Restrictionmap.pm module works, you can see that its get_graphic 


method examines the graphictype attribute to determine what graphics drawing method to 


call. Following is the get_ graphic method again, along with the attributes the class defines. 
(Recall that this class inherits from the class Restriction.pm, so many things are defined 
there): 


# Class data and methods 


{ 
# A list of all attributes with default values. 


fo) 


my % attributes = ( 








# key = restriction enzyme name 

# value = space-separated string of recognition sites => regular 
expressions 

_rebase => { }, # A Rebase.pm object 

_ sequence => '', # DNA sequence data in raw format (only 
bases) 

_enzyme => '', # space separated string of one or more 
enzyme names 

_map => { }, # a hash: enzyme names => arrays of 
locations 


HE 


_graphictype => "text', one of 'text' or 'png' or some other 
_graphic => '', # a graphic display of the restriction map 
); 


# Return a list of all attributes 
sub _all attributes { 
keys % attributes; 


} 
} 


sub get graphic { 
my ($self) = @ ; 


# If the graphic is not stored, calculate and store it 
unless ($self->{_graphic}) { 


unless ($self->{_graphictype}) { 
croak ‘Attribute graphictype not set (default is "text")'; 
} 


# if graphictype is "xyz", method that makes the graphic is 
" drawmap xyz" 
my Sdrawmapfunctionname = " drawmap_" . Sself->{_graphictype}; 


# Calculate and store the graphic 
S$self->{_ graphic} = $self->$drawmapfunctionname; 


} 


# Return the stored graphic 
return $self->{_ graphic}; 


} 
As you recall, the get_ graphic method requires that some graphic type be defined in the 
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_graphictype attribute. It then tries to call a method whose name incorporates the name of 
the graphic type; for instance, if the graphic type is "png", the get_ graphic method tries to 
call a graphic drawing method called drawmap_png, 


How might you design such a get_graphic method? 
8.3.1 Designing Graphics 


There are several ways to construct a PNG graphic to draw a restriction map. You can extract 
the data necessary using drawmap text and its helper methods, 

_initialize annotation text, add_annotation_ text, and _formatmaptext. These 
methods use the methods of the Restriction class to get the locations of restriction sites 
for each enzyme and then make a text version of an annotated sequence string that shows 
the names of the restriction enzymes above their respective restriction sites in the sequence. 


Having found the locations of the restriction sites in the sequence, there are many ways you 
can visually represent the map. You can incorporate a design of double-stranded DNA, for 
instance, to serve as a background to the actual sequence you can draw on top of the double 
helical design. You can use colors and different sizes and choices of fonts to make the 
restriction sites stand out. You can draw the annotation in various ways, perhaps giving the 
restriction enzyme and the location in some visually appealing way near the restriction site. 
You may or may not want to draw the entire sequence; perhaps you could truncate long 
stretches of sequence that have no restriction sites, for instance, to make the relevant 
information (the restriction site locations) fit better into a smaller, more easily grasped space, 
ideally on one screen. 


The following _drawmap_png method uses the output of the drawmap_ text method and 
makes a PNG graphic out of it. 


# 
# Method to output graphics in PNG format 
# 


sub _drawmap_png { 
my($self) = @ ; 


# Get text version of graphic 
my @maptext = split( /\n+/, $self->_drawmap_text); 


# Now make a PNG graphic from the text version 


use GD; 


Layout information: fonts, margins, image size 


Use built-in GD fixed-width font 'gdMediumBoldFont' (could use TrueType 








# Font character size in pixels 
my (Sfontwidth, $fontheight) = (gdMediumBoldFont->width, 
gdMediumBoldFont->height) ; 





Margins top, bottom, right, left, and between lines 
my (Stmarg, Sbmarg, Srmarg, S$lmarg, Slinemarg) = (10, 10, 10, 10, 5); 





Image width is length of line times width of a character, plus margins 
my (Simagewidth) = (length(Smaptext[0]) * Sfontwidth) + Slmarg + Srmarg; 








Image height is height of font plus margin times number of lines, plus 
margins 
my (Simageheight) = ((Sfontheight + Slinemarg) * (Scalar @maptext)) + 
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Stmarg 
+ Sbmarg; 


my Simage = new GD::Image(Simagewidth, Simageheight) ; 
# First one becomes background color 

my Swhite = Simage->colorAllocate(255, 255, 255); 

my Sblack = Simage->colorAllocate(0, 0, 0); 

my Sred = Simage->colorAllocate(255, 0, 0); 


# Origin at upper left hand corner 





my ($x, Sy) = (S$lmarg, $tmarg); 
# 

# Draw the lines on the image 

# 


foreach my $line (@maptext) { 
chomp $line; 
# Draw annotation in red 
if(Sline =~ / /) { #annotation has spaces 
Simage->string (gdMediumBoldFont, $x, Sy, Sline, Sred); 
# Draw sequence in black 
}else{ #sequenc 
Simage->string(gdMediumBoldFont, $x, Sy, Sline, Sblack); 





} 
Sy += (Sfontheight + Slinemarg) ; 
} 


return Simage->png; 


} 


_drawmap_png has a fairly straightforward logic, given that I just drew the same text as is 
formatted in the drawmap_text method. So, it's no surprise that the first step of the 
_drawmap_png method is to get the formatted text. Recall that the lines of the text are joined 
into a string so the graphic can be stored in a scalar variable. So, the scalar string is split on 
newlines back into an array @maptext: 


# Get text version of graphic 
my @maptext = split( /\n+/, S$self->_drawmap_text); 


Now, the GD module is loaded: 


# Now make a PNG graphic from the text version 
use GD; 


Next, various layout dimensions are calculated: 
Layout information: fonts, margins, image size 


Use built-in GD fixed-width font 'gdMediumBoldFont' (could use TrueType 
fonts) 








Font character size in pixels 
my ($fontwidth, Sfontheight) = (gdMediumBoldFont->width, 
gdMediumBoldFont->height) ; 


# Margins top, bottom, right, left, and between lines 
my (S$tmarg, Sbmarg, Srmarg, S$lmarg, Slinemarg) = (10, 10, 10, 10, 5); 


# Image width is length of line times width of a character, plus margins 
my (Simagewidth) = (length(Smaptext[0]) * $fontwidth) + Slmarg + S$rmarg; 








# Image height is height of font plus margin times number of lines, plus 
margins 
my (Simageheight) = ((S$fontheight + Slinemarg) * (scalar @maptext)) + S$tmarg 
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+ Sbmarg; 


GD includes a handful of built-in, fixed-width fonts. It also permits the use of TrueType fonts; 
however, although they are freely available, high quality, very extensive, and probably 
available on your system, they are nevertheless not available everywhere. As a result, I've 
decided to use a built-in font that's always available with GD. 


However, as a result, the method has the font name hardcoded in a few places. (See the 
exercises at the end of the chapter.) For instance, GD built-in font names are objects with 
methods associated with them, in particular, methods that return the width and height of the 
font in pixels, which my _drawmap_png method saves in the variables $fontwidth and 
Sfontheight. 7 


Various margins are then set for the page as a whole and for the spacing between lines. 


Finally, the overall dimensions of the image are calculated: 


# Image width is length of line times width of a character, plus margins 
my (Simagewidth) = (length(Smaptext[0]) * $fontwidth) + Slmarg + S$rmarg; 





# Image height is height of font plus margin times number of lines, plus 
margins 

my (Simageheight) = ((S$fontheight + Slinemarg) * (scalar @maptext)) + Stmarg 
+ Sbmarg; 


The comments explain the logic. Recall that @maptext is the array that contains the text to be 
formatted. Smaptext [0] is the first line, which is either a full line and thus the maximum 
width, or it's less than a full line but is the only line, and thus the maximum width. (scalar 
@maptext) returns the number of lines in the @maptext array (because an array in scalar 
context returns the size of the array). 


After those preliminary calculations, the constructor for the GD class can be called to create a 
new image object Simage of the correct dimensions: 


my Simage = new GD::Image(Simagewidth, Simageheight) ; 
8.3.1.1 Applying color 


The colors I use can now be allocated for the image object. In GD, the first call to the 
colorAllocate method determines the background color for the entire image. The triples of 
numbers are the red, blue, and green values that are combined to form the various possible 
colors: 


# First one becomes background color 

my Swhite = Simage->colorAllocate(255, 255, 255); 
my Sblack = Simage->colorAllocate(0, 0, 0); 

my Sred = Simage->colorAllocate(255, 0, 0); 





Next, the x and y (horizontal and vertical) positions are initialized at the origin of the image to 
be drawn. In GD, the origin of the picture is 0,0 at the upper left corner. Increasing values of x 
move from left to right across the image; increasing values of y move from top to bottom 
down the image. Given that I want to have top and left margins, I set the image origin at the 
upper left corner, offset by the values of the margins: 


# Origin at upper left corner 
my ($x, $y) = ($lmarg, $tmarg); 


Now, everything is set up, and it's time to draw the image. The text to be drawn is processed 
line-by-line; after each line is drawn, the vertical position Sy is increased by the height of the 
font plus the interline margin. 


There are two kinds of lines: sequence and annotation. I assume that each annotation line 
contains some spaces (this won't always be true; see the exercises). Annotation lines are 
printed as strings in the color red, and sequence lines are printed in the color black. 

# 


# Draw the lines on the image 
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# 
foreach my $line (@maptext) { 
chomp $line; 
# Draw annotation in red 
if(Sline =~ / /) { #annotation has spaces 
Simage->string(gdMediumBoldFont, $x, Sy, Sline, S$red); 
# Draw sequence in black 
}else{ #sequenc 
Simage->string(gdMediumBoldFont, $x, Sy, Sline, Sblack); 








} 
Sy += (Sfontheight + Slinemarg); 
} 


8.3.1.2 Calling the method 


Finally, after all the text lines have been formatted and added to the image, the png GD 
method is called on the image. This produces a PNG image, which is returned from my 
_drawmap_png method: 


return Simage->png; 
Because that method was called by the method _getgraphic, the PNG graphic is both saved 


in the graphic attribute and sent off to the user's browser, which knows how to display a 
PNG image. 


In other circumstances you might want to save the image to a file, which you can do by 
opening a file, assigning it a filehandle (say IMG), and printing the $image->png output to the 
file: 


print IMG Simage->png; 
8.3.2 Adding JPEG Output to Restrictionmap.pm 


Let's consider how to add JPEG output to Restrictionmap.pm. I need a_drawmap_jpg 
method to handle the case when a JPEG image is requested by the programmer setting the 
_graphictype attribute to "jpg". 


Looking over the GD documentation and the _drawmap_png method I've just written, I see 
that the last line of my _drawmap_png method: 


return Simage->png; 

can be replaced by: 

return Simage->jpeg; 

in order to output a JPEG format. So, here's my new _drawmap_jpg method, almost an exact 
duplicate of drawmap_png (see the exercises at the end of the chapter). 


# 
# Method to output graphics in JPEG format 
# 





sub drawmap_jpg { 
my(S$self) = @ ; 


Get text version of graphic 
my @maptext = split( /\n+/, $self->_drawmap_text); 





# Now make a JPEG graphic from the text version 
use GD; 


mn 


# Layout information: fonts, margins, image size 
mr 


# Use built-in GD fixed-width font 'gdMediumBoldFont' (could use TrueType 
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fonts) 


# 
# Font character size in pixels 
my (Sfontwidth, S$fontheight) = (gdMediumBoldFont->width, 


gdMediumBoldFont->height) ; 


Margins top, bottom, right, left, and between lines 
my (Stmarg, Sbmarg, Srmarg, Slmarg, Slinemarg) = (10, 10, 10, 











margins 
my (Simageheight) = 


((S$fontheight + Slinemarg) * (scalar @maptext)) + Stmarg + 


Sbmarg; 
my Simage = new GD::Image(Simagewidth, Simageheight) ; 


# First one becomes background color 

y Swhite = Simage->colorAllocate(255, 255, 255); 
y Sblack = Simage->colorAllocate(0, 0, 0); 

y Sred = Simage->colorAllocate(255, 0, 0); 


B44 


# Origin at upper left hand corner 





my ($x, Sy) = (S$lmarg, $tmarg); 
# 

# Draw the lines on the image 

# 


foreach my $line (@maptext) { 
chomp $line; 
# Draw annotation in red 
if(Sline =~ / /) { #annotation has spaces 
Simage->string(gdMediumBoldFont, $x, Sy, Sline, Sred); 
# Draw sequence in black 
}else{ #sequenc 








Simage->string (gdMediumBoldFont, $x, Sy, Sline, Sblack); 


} 
Sy += (Sfontheight + Slinemarg) ; 
} 


return Simage->jpeg; 


} 


To use this new method from the webrebasel CGI Perl script, I'd have to change the 
requested graphictype attribute to jpeg; I'd also have to change the HTML header line that 


gets printed out just before printing the image data. Now, that line is set to: 
print "Content-type: image/png\n\n"; 

but I'd have to change that to: 

print "Content-type: image/jpeg\n\n"; 


You'll notice if you try that out, the JPEG graphic seems a bit blurry around the edges of the 
letters. This is a result of the compression algorithm that JPEGs use. They tend to work very 
well with photographic images, but the kind of text and line images that webrebasel creates 


do not fare as well under JPEG compression. 


4 PREVIOUS 


BRT 
fey] 


Image width is length of line times width of a character, plus margins 
my (Simagewidth) = (length(Smaptext[0]) * Sfontwidth) + Slmarg + Srmarg; 


Image height is height of font plus margin times number of lines, 
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8.4 Making Graphs 


Displaying data graphically (graphs, bar graphs, pie charts, etc.) is the most common goal of 
image programming You've just seen how the GD Perl module helps you create graphics. It's 
great for quickly providing images from CGI web programs. 


GD: :Graph, another Perl module that uses GD as its underlying mechanism, does an excellent 
job of rendering all sorts of graphs. Programming graphs with GD: : Graph is relatively easy, 
especially if you start with the example programs and modify them for your own purposes. 
The module is object oriented, and it provides many options for displaying the various graphs. 
Plan to take some time exploring the documentation and trying different options as you begin 
programming with GD: : Graph. Its flexible yet relatively simple programming interface has 
made it a popular and powerful module. GD: :Graph can be found on CPAN. It's simple to 
install once you have GD itself installed. 


For more in-depth information on graphics programming in Perl, see Perl Graphics 
Programming by Shawn Wallace (O'Reilly). The book includes a chapter on GD: : Graph that 
can serve as a fine tutorial. The documentation for GD: : Graph (on CPAN, or type perldoc 
GD: :Graph in a terminal window) is well-written and clearly organized. The software comes 
with several example programs that will help get you started. 


The following short program, gd1l.pl, uSeS GD: : Graph to create a simple bar graph. 
!/usr/bin/perl 


gdl.pl -- create GD::Graph bar graph 


use strict; 

use warnings; 

use Carp; 

use GD::Graph::bars; 





my Sdataset = ( 1 => 3, 
2=> 17, 
3 => 34, 
4 => 23, 
5 => 25, 
6 => 20, 
7 => 12, 
8 => 3, 
9 => 1 


)3 


# create new image 
my Sgraph = new GD::Graph::bars(600, 300); 


# discover maximum values of x and y for graph parameters 
my( S$xmax) = sort {$b <=> Sa} keys %dataset; 

my( Symax) = sort {Sb <=> Sa} values %dataset; 

# how many ticks to put on y axis 

my Syticks = int(Symax / 5) + 1; 


# define input arrays and enter 0 if undefined x value 
my(@xsizes) = (0 .. Sxmax); 

my(@ycounts) = (); 

foreach my $x (@xsizes) { 

if ( defined Sdataset{$x}) { 
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push @ycounts, Sdataset{$x}; 
jelse{ 
push @ycounts, 0; 
} 
} 


# set parameters for graph 
Sgraph->set ( 


transparent => 0, 

title => "Summary of mutation data", 
x label => 'Mutants per cluster', 
y_label => 'Number of mutants', 

x all ticks = >> hy 

y_all_ticks => 0, 

y_tick number => Syticks, 

zero axis => 0, 

zero axis only => 0, 


)e 
# plot the data on the graph 
my Sgd = Sgraph->plot ( 

[ ° xsizes, 


* ycounts 
); 


# output file 
my Spngfile = "gdgraphl.png"; 
unless(open(PNG, ">Spngfile")) { 

croak "Cannot open Spngfile:$!\n"; 


} 


# set output file handle PNG to binary stream 

# (this is important sometimes, for example doing 
# GCI programming on some operating systems 
binmode PNG; 


# print the image to the output file 

print PNG Sgd->png; 

Figure 8-1 shows the image that is written to the file gdgraphl.png. On my system it's very 
small for an image, only 2028 bytes. 


Figure 8-1. A simple bar graph 


Summary of mutation data 


Number of mutants 


0 1 2 3 4 5 6 7 8 5 
Mutants per cluster 


Here's how the code works. The new module is loaded like so: 


use GD::Graph::bars; 


The dataset is then defined as a hash. In reality, GD: :Graph expects to see the data in arrays. 


Page 283 


However, since your program may first collect data in a hash, I've shown you how to 
incorporate hash data into a GD: : Graph image and included the code that extracts the 
required x and y arrays from the hash. 


Next, the module's new constructor method is called to initialize a graph of type bars and 
sized at 600 pixels horizontal and 300 pixels vertical. 


# create new image 
my Sgraph = new GD::Graph::bars(600, 300); 


The next bit of code prepares some of the values needed to set parameters for the graph. 
The maximum x and y values are extracted from the hash sdataset. There is a way to 
specify how many ticks are desired along the y axis, but if you want one every five values, 
say, you have to figure out how many such values will just fit the data: hence the $yticks 
calculation. 


# discover maximum values of x and y for graph parameters 
my( $xmax) = sort {$b <=> Sa} keys %dataset; 
my( Symax) = sort {$b <=> Sa} values %dataset; 


# how many ticks to put on y axis 
my Syticks = int(Symax / 5) + 1; 


Next, the actual arrays of x and y values are constructed from the data in dataset: 


# define input arrays and enter 0 if undefined x value 
my(@xsizes) = (0 .. Sxmax); 
my(@ycounts) = (); 
foreach my $x (@xsizes) { 
if ( defined Sdataset{$x}) { 
push @ycounts, Sdataset{$x}; 
}else{ 
push @ycounts, 0; 
} 
} 


Finally the various parameters are set (there are quite a few possible parameters, as you'll 
see in the documentation), the data is plotted in the graph, an output file handle is created 
and set to binary mode (important for CGI web programming), and the image data is written 
out. 


Here is another version, gd2.p1, of the same program. This version creates "lines and points 
graphs that have the values written over each datapoint. Instead of showing the data as a 
bar, this graph shows data points joined by lines. The data points have the value written next 
to them, except for the value at x=3, which is suppressed. Other than that, it's the same 
image from the same data as used in gd1.pl. 


#!/usr/bin/perl 


gd2.pl -- create GD::Graph "linespoints" graph 


use strict; 

use warnings; 

use Carp; 

use GD::Graph::linespoints; 





my Sdataset = ( 1 => 3, 
2=> 17, 

3 => 34, 

4 => 23, 

5 => 25, 

6 => 20, 

7 => 12, 
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)3 


create new image 
y Sgraph = new GD::Graph::linespoints(600, 300); 


3 


discover maximum values of x and y for graph parameters 





my( S$xmax) = sort {$b <=> Sa} keys %dataset; 

my( Symax) = sort {Sb <=> Sa} values %dataset; 
how many ticks to put on y axis 

my Syticks = int(Symax / 5) + 1; 








define input arrays and enter 0 if undefined x value 
my(@xsizes) = (0 .. Sxmax); 
my(@ycounts) = (); 
foreach my $x (@xsizes) { 
if ( defined Sdataset{$x}) { 
push @ycounts, Sdataset{$x}; 
jelse{ 
push @ycounts, 0; 


} 


# define input data 
# need to do separately so show values will work 
my Sdata = GD::Graph::Data->new ( 


° . 
[ xSizes, 


* ycounts 
); 


# use show _values to show values of data next to points 
# use Svalues to suppress printing of value for x=3 

my Svalues = S$data->copy; 

Svalues->set_y(1, 3, undef); 


# set parameters for graph 
Sgraph->set ( 


transparent => 0, 

title => "Summary of mutation data", 
x label => 'Mutants per cluster', 
y_label => 'Number of mutants', 

x all ticks => 1, 

y_all_ticks => 0, 

y_tick number => Syticks, 

zero axis => 0, 

zero axis only => 0, 

show values => Svalues, 


); 
# plot the data on the graph 
my Sgd = Sgraph->plot ( 
[ *° xsizes, 
* ycounts 
\; 
# output file 


my Spngfile = "gdgraph2.png"; 
unless(open(PNG, ">Spngfile")) { 
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croak "Cannot open Spngfile:$!\n"; 


} 
binmode PNG; 


print PNG Sgd->png; 
Figure 8-2 shows the image that's written to the file gdgraph2.png. 


Figure 8-2. A lines and points graph 


Summary of mutation data 


Number of mutants 





Mutants per cluster 


A variety of image viewing programs may be used to display your results while you're doing 
graphics programming. ImageMagick has a display program called display that runs ona 
Linux system and elsewhere. I often use xv, and while it usually works quite well, it seems to 
have trouble with PNG files. When programming, it's a good idea to pick an imaging program 
that can be set to automatically update the display when you overwrite the image file with 
new data. It's common for a system's user interface to allow you to click on the image file's 
name or thumbnail image in order to launch an image viewing program with the image 
loaded. 


This section just gives you a taste of the capabilities of GD: : Graph. The module provides 
many possibilities, all of which are easily incorporated into your Perl programs for output to 
files or sent to a web browser from your CGI program. 
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8.5 Resources 


I recommend the book The Visual Display of Quantitative Information by Edward Tufte 
(Graphics Press). 


Although I don't demonstrate their use in this book, I want to mention two more full-featured 
graphics software packages that are free, and very useful if you'll be doing graphics to any 
extent: 


e ImageMagick and its Perl interface module Image: :Magick is a valuable tool, especially 
for creating images. Over 60 different file formats are supported, compared to the 
much more limited number in GD. A wide range of image-manipulation techniques are 
supported, enabling you to create animation, thumbnails, and apply many image 
processing functions, for example. 


e The Gimp (GNU Image Manipulation Program) is very much like the Photoshop 
program. It has an interface to Perl scripts that is built-in, and there are several 
modules, starting with the Gimp.pm module, that provide Perl programmers with 
access to its functionality. 


Between GD, Gimp, and ImageMagick, the Perl programmer has ready access to a host of 
image creation and manipulation software, at no cost. 
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8.6 Exercises 


Exercise 8.1 


Add an argument "linelength=>50" to Restrictionmap to set the length of lines in 
the graphic output. 


Exercise 8.2 


What happens if you request a graphic type for which no corresponding method has 
been defined? Would there be a better way to handle this situation than the current 
behavior of the module? 


Exercise 8.3 


If more than one enzyme is present, draw each enzyme in a different color. You'll 
have to solve the problem of how to define the required number of colors and how to 
choose them so that they'll be easily distinguishable from each other. You'll also have 
to rework the annotation lines when more than one type of enzyme appears on an 
annotation line, drawing the different sections individually with different colors. 


Exercise 8.4 


Add header information such as filename, count of restriction sites for each enzyme, 
and date, to the text or PNG or JPEG graphic generated by drawmap_text or 
_drawmap png or drawmap_ jpg. 


Exercise 8.5 


In drawmap_ text, add a number at the beginning of each line of sequence giving the 
position of the first base on that line. First figure out the number of digits needed for 
the last line, so you can make the appropriate amount of space before each line. Can 
you right-justify the numbers in the allotted space? 


Exercise 8.6 


In _ drawmap_ text, put a space between each group of 10 bases on a line to make it 
easier to find a particular base position. 


Exercise 8.7 


In drawmap_ png or drawmap_jpg, instead of putting the enzyme names in 
annotation lines above the sequence lines, try simply printing the sequence lines, but 
highlight each restriction site with color and enclosing it in a box. Try giving each type 
of enzyme in the map a different color and type of box and add a key that shows 
which enzyme matches up with each color/box. 


This works fine with a single enzyme. Does it work with two enzymes? With three or 
more? Do you need to change the output of drawmap_text to accomplish this? 


Exercise 8.8 


Rewrite the webrebasel CGI Perl script so that it is two programs: one that displays 
the opening screen form, and the other that calculates the PNG image. Say you call 
the image-generating CGI script rebase_png. Then use the CGI for an HTML tag such 
as: 





print img({ -srce => "/cgi-bin/rebase png?enzyme=EcoRI; fileseq= 
sampleecori.dna2" }); 





In this way, you can embed a dynamically generated image into a larger HTML 
document; it's an alternative to the method shown in the text in which the image is 
shown by itself in the web browser. 


Exercise 8.9 
The drawmap_jpg and _drawmap_png methods are almost identical. Add a third 
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method that does most of the work of these methods, taking as an argument the file 
type desired for output. Then rewrite drawmap_ jpg and drawmap_ png so that they're 
much shorter. Now add a_drawmap_wbmp method. 


Exercise 8.10 
The drawmap_png method assumes that annotation lines contain spaces while 
sequence lines do not. This is called an heuristic, a rule of thumb that is easy to 
compute and leads to a simplification of program logic. What if there is an annotation 
line that has no spaces because it annotates a line of sequence that is solid, 


wall-to-wall restriction sites? Invent another method of distinguishing between 
annotation and sequence lines. 


Exercise 8.11 


Use GD: :Graph to make a different representation of a restriction map. 
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Chapter 9. Introduction to Bioperl 


Bioperl is a collection of more than 500 Perl modules for bioinformatics that have been 
written and maintained by an international group of volunteers. Bioperl is free (under a very 
unrestrictive copyright), and its home is http://bioperl.org. 





One of the most difficult things about Bioperl is getting started using it. This is due to a 
scarcity of good documentation (which is being rectified) as well as the sheer size of the 
Bioperl module library. This chapter will help you get started using the Bioperl project 
software; it will guide you through the initial steps of getting the software, installing it, and 
exploring the tutorial and example material that it provides. After working through this 
chapter, you'll be well prepared to delve deeper into the riches of Bioperl, and, if you've also 
worked through the object-oriented chapters earlier in this book, you'll be in a good position 
to read the Bioperl code and contribute to the project yourself. 


The modules in Bioperl are written in the object-oriented style. Perl programmers who do not 
know object-oriented programming can still use the Bioperl modules with just a bit of extra 
information, as outlined in Chapter 3. 


The Bioperl modules cover various areas of bioinformatics, including some you've seen 
previously in this book. Although Bioperl includes some example programs, it is not meant to 
be a collection of complete user-ready programs. Rather, it's implemented as a toolkit you 
can dip into for help when writing your own programs. Its goal is to provide good working 
solutions to common bioinformatics tasks and to speed your program development. 


One of the best things about Bioperl is that it's an open source project, meaning that 
interested developers are invited to contribute by writing code or in other ways, and the code 
is available to anyone interested. If you've learned enough about Perl for bioinformatics to 
have worked through a good portion of this book, you'll find plenty of opportunity to get 
involved in Bioperl if you have the time and inclination. 


Team LiB 


Page 290 


[Team LiB J 


9.1 The Growth of Bioperl 


The practice of freely distributing Perl software for bioinformatics over the Internet began 
around 1992. Gradually, Perl became more and more popular for biology applications.” The 
release of Perl Version 5 and its support for object-oriented programming accelerated the 
development of reusable modules for biology at many research centers around the world. 
The large-scale genome sequencing efforts, then underway, provided much of the impetus, as 
well as the talent and funding, for these efforts. The Bioperl project, officially organized in 
1995, coalesced around one of these bioinformatics research groups that was doing a good 
job of organizing the collective volunteer effort such collaborative projects require. More 
information, including a short history and list of contributors, can be found at the Bioperl site 
http://bioperl.org. The Bioperl web site is frequently being updated and improved, and is the 
primary source of Bioperl code and documentation. 





a See, for example, the article "How Perl Saved the Human Genome Project" by Lincoln Stein at 
http://bioperl.org/GetStarted/tpj_Is_bio.html. 





Today, the Bioperl project has grown to a point where it is both useful enough, and well 
enough documented, that it is a must for Perl programming in bioinformatics. The 
documentation includes a program bptutorial.pl that comes with Bioperl, which explains 
and demonstrates several areas of the project (more on that later in this chapter). 
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9.2 Installing Bioperl 


Installing Bioperl's large collection of modules isn't too difficult. It usually goes fairly painlessly, 
even though there are a few extra installation steps due to the additional outside programs 
and Perl modules on which Bioperl depends. You will probably want to use Perl's CPAN to 
fetch and install Bioperl. 


INSTALL is a good document that covers Bioperl installation on Unix/Linux, Windows, and Mac 
operating systems. It's part of the Bioperl distribution and is located at 
http://bioperl.org/Core/Latest/INSTALL. This location may change, but it's easy to find from 
the Download link on the main Bioperl web page. If you're going to install Bioperl, I 
recommend you read this document first; here, I'll give an overview of what's available and 
add a few additional comments that may help with installation. 





Here's how to get Bioperl on different platforms: 


e On Unix/Linux, if you download the tar file from the web site, you merely need to 
untar it and go through the configure and make process by hand, as described in the 
INSTALL file that comes with the distribution. 


e If you have a Microsoft Windows machine with ActiveState's Perl ( 
http://www.activestate.com), there's a PPM file available for Bioperl; at the time of 
this writing, it's at http://bioperl.org/ftp/DIST/Bioperl-1.2.1.ppd. 








e There is also a CVS repository for Bioperl from which you can fetch the most current 
versions of the modules. But be careful: some newer versions of the modules are 
implementing new features and have more bugs than are found in some of the older, 
more stable releases. The details of how to install from CVS are also available at the 
Bioperl web page. 


All these methods for installing Bioperl are fine, but probably the most common way for Perl 
programmers to install sets of modules is by way of CPAN. I discussed CPAN in Chapter 1, 
but it's worth discussing again as it relates to Bioperl. 


To install Bioperl, you start by typing the following at the command line: 

perl -MCPAN -e shell; 

This gives the CPAN shell prompt: 

cpan> 

It's often the case that a module you want to install may require other modules for its proper 
operation, and perhaps one or more of these additional modules have not yet been installed. 


The CPAN system includes a way to check to see what other modules are required, and you 
can configure it to automatically follow and install missing prerequisites. 


Especially with a large collection of modules like Bioperl, you may expect to see such 
prerequisites crop up. When I asked my CPAN session how it was configured: 


cpan> o conf 

one of the lines of output from that query was: 

prerequisites policy ask 

I couldn't find a way to list the options within my CPAN session, so I took a look at the CPAN 


documentation in another window with perldoc CPAN, searched for the string 
prerequisites policy, and read the following: 
prerequisites policy 
what to do if you are missing module prerequisites 
('follow' automatically, 'ask' me, or ‘ignore') 
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To handle the prerequisites automatically, I configured my CPAN session like so: 


cpan> o conf prerequisites policy follow 


Now, if you have a slow Internet connection, setting the option this way can be a problem, 
because some prerequisites may take a long time to install, and you'll be stuck waiting for 
everything to finish. In that case, you may prefer to be asked about each prerequisite, so you 
can see if you have the needed time to do the installation. Plan on the process taking at least 
a couple of hours. My experience using both a dial-up modem and a cable modem from a 
home office is that the time involved was not too onerous. 


As you probably know by now, you can see a summary of CPAN session commands by 
typing help: 
cpan> help 


Display Information 












































command argument description 

a,b,d,m WORD or /REGEXP/ about authors, bundles, distributions, modules 
a. WORD or /REGEXP/ about anything of above 

r NONE reinstall recommendations 

ls AUTHOR about files in the author's directory 
Download, Test, Make, Install... 

get download 

make make (implies get) 

test MODULES, make test (implies make) 

anstall DISTS, BUNDLES make install (implies test) 

clean make clean 

look open subshell in these dists' directories 
readme display these dists' README files 
Other 

iy 2 display this menu ! perl-code eval a perl command 

o conf [opt] set and query options gq quit the cpan shell 
reload cpan load CPAN.pm again reload index load newer indices 
autobundle Snapshot force cmd unconditionally do cmd 
cpan> 


To find the Bioper! distribution, I typed: 
cpan> i /bioperl/ 
and got the following output: 


cpan> i /bioperl/ 
CPAN: Storable loaded ok 


























Going to read /root/.cpan/Metadata 
Database was generated on Sun, 27 Apr 2003 04:12:50 GMT 
Bundle Bundle::BioPerl (C/CR/CRAFFI/Bundle-BioPerl-2.04.tar.gz) 
Distribution B/BI/BIRNEY/bioperl-0.05.1.tar.gz 
Distribution B/BI/BIRNEY/bioperl-0.6.2.tar.gz 
Distribution B/BI/BIRNEY/bioperl-0.7.0.tar.gz 
Distribution B/BI/BIRNEY/bioperl-1.0.2.tar.gz 
Distribution B/BI/BIRNEY/bioperl-1.0.tar.gz 
Distribution B/BI/BIRNEY/bioperl-1.2.1.tar.gz 
Distribution B/BI/BIRNEY/bioperl-1.2.tar.gz 
Distribution B/BI/BIRNEY/bioperl-db-0.1.tar.gz 
Distribution B/BI/BIRNEY/bioperl-ext-0.6.tar.gz 
Distribution B/BI/BIRNEY/bioperl-gui-0.7.tar.gz 
Distribution C/CR/CRAFFI/Bundle-BioPerl-2.04.tar.gz 
Module Bio: :LiveSeq::10::BioPerl 

(B/BI/BIRNEY/bioperl-1.2.1.tar.gz) 











13 items found 


cpan> 
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From looking at the Bioperl web page, I knew that the Bundle: :BioPerl had extra useful 
modules in it that Bioperl uses. 





I started the installation by first installing the Bundle: 


cpan> install Bundle::BioPerl 


This process fetched the module code from a CPAN repository, unpacked it, tested it, and 
installed it, including its prerequisites (without questions asked because I set the follow 
option as described earlier in this section). 


Somewhat fortified by my success, I decided to go for broke and install the main Bioperl 
distribution. 


From the search output from my CPAN query, i /bioperl1/, just shown, I saw that the 
highest numbered release was 1.2.1. Just to be sure it was the latest release, I also spent 
some time at the Bioperl web site reading the news about the latest releases, so I was quite 
sure it was what I wanted. 


I also spent some time reading the INSTALL file, and I knew that I was generally in good 
shape. I had a new-enough version of Perl (Version 5.8.0) and was on one of the standard 
supported platforms. 


I installed it on a notebook computer with an Intel 686 processor that had the RedHat Linux 
7.2 operating system. The computer and the operating system were about two years old, old 
enough that I did do some looking around on the Bioperl web site to see if there were any 
advisories about which versions of Linux were recommended or advised against. 


In general, modern computer systems are complex, and they change rapidly. Operating 
systems and hardware have a replacement cycle of about two years. Some of this is planned 
obsolescence; some is legitimate technical progress. Whatever the cause, the result is that a 
system such as the one I'm describing needs several pieces to be in sync. The hardware, 
operating system software, Perl version, and Bioperl version have to coexist; other pieces 
such as the C compiler, web browser, web server, and so forth may also cause problems. 
Hence my interest in checking to see if there were any advisories on these topics. 


However, I didn't see any warnings about my particular platform. And since I had recently 
installed the latest version of Perl with success, I felt I had performed due diligence and that I 
should go ahead and try to install the Bioperl modules. 


To do so, I typed: 
cpan> install B/BI/BIRNEY/bioperl-1.2.1.tar.gz 














There followed a great amount of activity as CPAN got the distribution from the Internet and 
tested the various modules. After a time, the tests were complete; a few of them had failed, 
and CPAN decided not to install the modules. Since only a handful of module tests failed, I 
looked at the output on my screen and decided that the failures were only in peripheral parts 
of Bioperl I didn't have an immediate use for, and if I ever did need them, I could fix them 
later. 


So, to finish the installation, I forced CPAN to do the installation despite the presence of some 
test failures: 


cpan> force install B/BI/BIRNEY/bioperl-1.2.1.tar.gz 








This resulted in the modules and the documentation being installed. 
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9.3 Testing Bioperl 


To check that things were working okay with my new Bioperl installation, I first wrote a little 
test to see if a Perl program could find the Bio: : Perl module: 





use Bio::Perl; 


I ran it by putting it in a file bp0.p1 and giving it to the Perl interpreter: 


[tisdall]# perl bp0.pl 
[tisdall] # 


As you see, it didn't complain, which means Perl found the Bio: : Perl module. If it can't find 
it, it will complain. When I copied the bp0.p1 program to a file called bp0.p1.broken and 
changed the module call of Bio: :Perl to a call for the nonexistent module Bio::Perrl, I got 
the following (slightly truncated) error output: 

[tisdall@coltrane development]$ perl bp0.pl.broken 

Can't locate Bio/Perrl.pm in @INC 

BEGIN failed--compilation aborted at bp0.pl.broken line 1. 


9.3.1 Second Test 











Now I knew that Bio: : Perl could be found and loaded. I next tried a couple of test programs 
that are given in the bptutorial.pl document. I found a link to that tutorial document on 
the Web from the http://bioperl.org page. Alternately, I could have opened a window and 
typed at the command prompt: 





perldoc bptutorial.pl 


I went to the section near the beginning of the document called "I.2 Quick getting started 
scripts" and created a file tut1.p1 on my computer by pasting in the text of the first tutorial 
script: 





use Bio::Perl; 


# this script will only work with an internet connection 
# on the computer it is run on 
S$seq object = get_sequence('swissprot',"ROA1 HUMAN"); 





write sequence(">roal.fasta",'fasta',$seq_ object); 


I then ran the program and looked at the output file: 


[tisdall]$ perl tutl.pl 

[tisdall]$ cat roal.fasta 

>ROA1 HUMAN Heterogeneous nuclear ribonucleoprotein Al (Helix-destabilizing 
protein) 
(Single-strand binding protein) (hnRNP core protein Al). 
SKSES PKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGEVT 
YATVEEVDAAMNARPHKVDGRVVE PKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHL 
RDY FEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKAL 
SKOEMASAS SSORGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSG 
DGYNGFGNDGGYGGGGPGYSGGSRGYGSGGOGYGNOGSGYGGSGS YDSYNNGGGRGFGGG 
SGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGOY FAKPRNQGGYGGSS 
SSSSYGSGRRF 

[tisdall]$ 

























































































That seemed to work perfectly. 
9.3.2 Third Test 


I tried the next short script from the same section of the tutorial, pasting it into a file called 
tut2 ply 
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[tisdall]$ cat tut2.pl 
use Bio::Perl; 


# this script will only work with an internet connection 
# on the computer it is run on 





S$seq_ object = get_sequence ('swissprot',"ROA1 HUMAN") ; 





# uses the default database - nr in this case 
Sblast_result = blast sequence ($seq) ; 


write blast (">roal.blast", $blast_report); 
[tisdall]$ perl tut2.pl 


See See SS SS Se WARNING = S=-SS== SSS SSSoSeSSes= 

MSG: req was POST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi 
User-Agent: libwww-perl1/5.69 

Content-Length: 178 

Content-Type: application/x-www-form-urlencoded 





Submitted Blast for [blast-sequence-temp-id] 

[tisdall]$ cat roal.blast 

[tisdall]$ ls -l roal.blast 

-rw-rw-r-- 1. Eaedali.- tas dai 1. O Apr 30 11:28 roal.blast 
[tisdall]$ 


Here, I experienced a problem. Running the program created a page of error output, mostly 
formatted in HTML, which I've truncated in the output. When I tried to cat the output file, 
there was nothing in it, which I verified (on Linux) by examining the file with 1s -1, which 
showed that it had 0 bytes in it. 


This wasn't very encouraging; the second of the two short example scripts was failing. 


Looking at the two programs tutl.pl and tut2.pl1, you can see that the second is an 
extension of the first; after getting the sequence, the second program also tries to blast it, 
and this is where it is failing. It's running (printing those ... dots took a while because the 
program was waiting for a reply from NCBI over the Internet), but it's not printing out the 
BLAST report to the file roal.blast. What's going wrong? 


I started my investigation by looking at the documentation for the Bio: : Perl module by 
typing: 
perldoc Bio::Perl 
and searching for the function that is failing, blast_sequence. Here's what I found: 
blast_sequence 

Title : blast sequence 


Usage : Sblast_result = blast_sequence ($seq) 
Sblast_result = blast _sequence ('MFVEGGTFASEDDDSASAEDE' ) ; 























Function: If the computer has Internet accessibility, blasts 
the sequence using the NCBI BLAST server against nrdb. 


It choose the flavour of BLAST on the basis of the sequence. 


This function uses Bio::Tools::Run::RemoteBlast, which itself 


use Bio::SearchIO - as soon as you want to more, check out 
these modules 
Returns : Bio::Search::Result::GenericResult.pm 
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Args : Either a string of protein letters or nucleotides, or a 
Bio::Seq object 
That last sentence about the Args gave me a clue. I went back and looked at the failing 
program, which I placed in a file called tut2.pl, and, sure enough, the argument for the 
blast_sequence function, $seq, is neither a string of sequence nor a Bio: :Seq object. In 
fact, it's not even defined! Clearly, what the tutorial writer meant instead of $seq, was 
$seq_object, which is defined and populated by the earlier get_sequence method call. 


I was visited by a brainwave, and I decided to put a use strict and a use warnings into the 
program and to declare variables with my to see what ensued: 

[tisdall]$ cat tut2.pl 

use Bio::Perl; 

use strict; 

use warnings; 





# this script will only work with an internet connection 
# on the computer it is run on 





my $seq object = get_sequence('swissprot',"ROA1 HUMAN") ; 


# uses the default database - nr in this case 
my Sblast_result = blast _sequence($seq) ; 


write blast (">roal.blast", $blast_report); 

[tisdall]$ perl tut2.pl 

Global symbol "Sseq" requires explicit package name at tut2.pl line 11. 
Global symbol "Sblast_report" requires explicit package name at tut2.pl line 
lee ie 

Execution of tut2.pl aborted due to compilation errors. 

[tisdall1]$ 








Well, that's pretty clear. The variables $seq and $blast_report are wrong; apparently, the 
author intended to reuse the variables $seq object and $blast_ result instead. So, I edited 
the file and ran it again: 

[tisdall]$ cat tut2.pl 

use Bio::Perl; 

use strict; 

use warnings; 





# this script will only work with an internet connection 
# on the computer it is run on 


my $seq_ object = get_sequence('swissprot',"ROA1 HUMAN") ; 





# uses the default database - nr in this case 
my S$blast_result = blast _sequence($seq_ object) ; 





write blast (">roal.blast", $Sblast_ result); 

[tisdall]$ perl tut2.pl 

[tisdall]$ perl tut2.pl 

Submitted. Blast for [ROAL HUMAN) 46.3 sageaewiad satin 48 4 








[tisdall]$ ls -l roal.blast 
=rw=rw-r== 1. ‘tisdall. ~tusdali 56888 May 5 15:05 roal.blast 
[tisdall]$ 


Examining the roal.blast file convinced me that the program had successfully called blast. I 
decided that was a success, albeit a qualified one: this was a spot in the documentation that 
could use a little attention, clearly. By the time you're reading this, it may well have been 
fixed. 


9.3.3 Fourth Test 
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Now, let me show you one more test of my new Bioperl installation. 


Looking at the Bio: : Perl documentation, I found the following example and discussion at the 
very beginning. 





Bio::Perl (3) User Contributed Perl Documentation Bio::Perl (3) 





NAME 











Bio::Perl - Functional access to BioPerl for people who 
don't know objects 


SYNOPSIS 
use Bio::Perl; 


will guess file format from extension 
Sseq object = read_ sequence ($filename) ; 





forces genbank format 
Sseq object = read_sequence ($filename, 'genbank"') ; 





reads an array of sequences 
@seq object_array = read_all_ sequences ($filename, 'fasta'); 











sequences are Bio::Seq objects, so the following methods work 
for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm' 




















print "Sequence name is ",$seq object->display id,"\n"; 

print "Sequence acc is ",$seq object->accession_number,"\n"; 
pring. "Pirst 5 bases is ",Sseq object=>subseqg (1,5) ,"\n"s 

# get the whole sequence as a single string 

S$sequence as_a_string = $seq object->seq( ); 


# writing sequences 
write sequence (">$filename", 'genbank',$seq object); 
write sequence (">$filename", 'genbank',@seq object_array); 


# making a new sequence from just strings you have 
# from something else 





Ss q_obj ct = new_sequenc ("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA", 
"myname", "AL12232") ; 





# getting a sequence from a database (assumes internet connection) 





Sseq object = get_sequence('swissprot',"ROA1 HUMAN") ; 





Sseq object = get_sequence('embl',"AI129902"); 





Sseq_ object = get_sequence('genbank',"AI129902"); 





# BLAST a sequence (assummes an internet connection) 





Sblast_report = blast _sequence ($seq_ object); 
write blast (">blast.out", S$blast_report); 


DESCRIPTION 
Easy first time access to BioPerl via functions 
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ty 








if DBACK 
Mailing Lists 





User feedback is an integral part of the evolution of this 
and other Bioperl modules. Send your comments and sugges- 
tions preferably to one of the Bioperl mailing lists. 

Your participation is much appreciated. 





bioperl-l@bio.perl.org 


Reporting Bugs 





Report bugs to the Bioperl bug tracking system to help us 
keep track the bugs and their resolution. Bug reports can 
be submitted via email or the web: 





bioperl-bugs@bio.perl.org 
http://bugzilla.bioperl.org/ 

















AUTHOR - Ewan Birney 
Email bioperl-l@bio.perl.org 
Describe contact details here 
APPENDIX 
The rest of the documentation details each of the object 
methods. Internal methods are usually preceded with a _ 


I took the example code from the SYNOPSIS section and put it in a file called bp1.p1. I noticed 
that the code was not meant to be a running program. The variable $filename refers to 
some sequence file in the first line of code without being initialized, and then the same variable 
name is used in several other lines for different purposes. This is not an uncommon situation 
in such SYNOPSIS sections; the point is to show how individual calls can be made to methods 
in the class, not to demonstrate a complete working program (although sometimes the code 
can be run exactly as is). 


I had to make some changes to make this code a working, runnable program. I declared the 
strict and warnings pragmas and declared each variable with my. I created variables for the 
different sequence filenames I needed as both input and output, and I made sure that the 
input sequence files were on disk and of the correct variety (for example, I created the 
array.fasta file with three FASTA headers and sequences one after the other). In the end I 
had this code: 


use Bio::Perl; 
use strict; 
use warnings; 


my Sgbfilename = 'AI129902.genbank'; 


# will guess file format from extension 
my $seq_object0O = read_sequence ($gbfilename) ; 





# forces genbank format 
my $seq objectl = read_sequence ($gbfilename, 'genbank') ; 





my Sfastafilename = 'array.fasta'; 


# reads an array of sequences 
my @seq object_array = read_all_ sequences ($fastafilename, 'fasta'); 


# Sequences are Bio::Seq objects, so the following methods work 
# for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm' 
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print "Sequence name is ",$seq objectl->display id,"\n"; 
print "Sequence acc is ",$seq objectl->accession_ number,"\n"; 
print "First 5 bases is ",$seq objectl->subseq(1,5),"\n"; 





# get the whole sequence as a single string 
my $sequence as_a_string = $seq objectl->seq( ); 


# writing sequences 


my Sgbfilenameout = 'bpout.genbank'; 
write sequence (">$gbfilenameout", 'genbank',$seq_object1); 
write sequence (">$gbfilenameout", 'genbank',@seq_ object array); 


# making a new sequence from just strings you have 
# from something else 





my Ss q_ obj ct2 = n w_sequenc ("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA", 
"myname", "AL12232") ; 


# getting a sequence from a database (assumes internet connection) 





my $seq object3 = get_sequence('swissprot',"ROA1 HUMAN") ; 





my $seq object4 = get_sequence('embl',"AI129902"); 








my $seq_ object5 = get_sequence('genbank',"AI129902"); 





# BLAST a sequence (assummes an internet connection) 





my S$blast_report = blast_sequence($seq_object3); 


write blast (">blast.out",$blast_report); 


As you see, I didn't check on the output of all the method calls, but the real purpose of this 
test of my newly installed Bioperl modules was just to see if all the modules and methods 
could be found and would run when called. 


As mentioned previously, I also added use strict; and use warnings; and declared all the 
variables with my. Often, you don't see that usage in this kind of documentation; it's not 
required in Perl, and it may distract some readers of the documentation from the main point 
of showing how the methods are called. This is not an unusual omission to find in Perl class 
documentation, but of course, it's recommended and often very helpful, as you saw in the 
last section. 





So, I ran my slightly edited version of the example code from the beginning of the Bio::Perl 
manpage, with the following results: 


[tisdall]$ perl bpl.pl 

Sequence name is AT129902 

Sequence acc is ATI129902 

First 5 bases is CICceG 

Submitted Blast for [ROA1 HUMAN] ......... 
[tisdall]$ 


I looked at each of the input and output files and verified the expected contents. The following 
output from a listing of the files will give you an idea of what to expect (although you may get 
different results by running the program with different input files or database lookups): 


-rw-rw-r-- 1 tisdall tisdall 2391 May 5 10:37 AI129902.genbank 
-rw-rw-r-- 1 tisdall tisdall 2485 May 5 13:00 array.fasta 
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-rw-rw-r-- L ‘tasdall, tasdall 1513 May 5 13:02 bpl.pl 
-rw-rw-r-- 1 tisdall tisdall 3653 May 5 13:02 bpout.genbank 
-rw-Yrw-r-- 1 tisdall tisdall 56888 May 5 13:04 blast.out 











Notice that at the end of the program I call the blast sequence method on the protein 
sequence object $seq_object3. I discovered that trying to run the blast_sequence method 
on a nucleotide sequence object (such as $seq_object5) failed. Although the documentation 
for the method said that the sequence type would be examined and the appropriate BLAST 
program called (for example, blastp for protein sequence and blastx for nucleotide 
sequence, against the nr nonredundant protein database), it always seemed to call blastp no 
matter what the input sequence, and therefore it failed unless called with protein sequence as 
I had stored in $seq_object3. Perhaps this bug, a disconnect between the code and the 
documentation, has been fixed by the time you read this. 
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9.4 Bioperl Problems 


Bioperl is still a work in progress, and it has some problems. I'd like to mention the two main 
problems now. 


First, the Bioperl documentation is incomplete. In fact, until fairly recently, there was no 
document that provided a tutorial introduction to the project. This has changed; the 
bptutorial.pl document, which you've already seen and will see more of, is an excellent 
beginning, despite its occasional errors. This document cleverly combines a tutorial with quite 
a few example programs that you can run, as you'll soon see. 


Other documentation for Bioperl is also available, including Internet-based tutorials, 
forthcoming books, example programs, and journal articles. So, the situation has recently 
improved. 


Second, Bioperl is big (over 500 modules), written by volunteers, and gradually evolving. The 
size of the project is a sign that Bioperl addresses many interesting and useful problems, but 
it also means that, for the new user of Bioperl, an overview of the available resources is a 
task in itself. 


The majority of the Bioperl code is quite good, especially the most-used parts of it. However, 
the volunteer and evolving nature of Bioperl development means that some of the code is 
unfinished and not as well integrated with other parts of the project as one would like. Newer 
or less used modules may still need some shaking out by users in real-world situations. This is 
where you can make an initial contribution to the project: as you find problems, report them 
(more on that later). 


Many of the computing world's most successful programs are the result of the same kind of 
volunteer development as Bioperl (the Perl language itself and the Apache web server are two 
examples). Bioperl is well positioned to achieve a similarly central position in the field of 
bioinformatics. 


Team LiB 
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9.5 Overview of Objects 


Bioperl is a big project and a fairly large collection of modules. Some of these modules are 
standalone; others interact with each other in various ways. 


Your first task in learning about Bioperl is to get an idea of the main subject areas the 
modules are designed to address. So to begin with, here is a brief overview of the main types 
of objects in Bioperl, collected in a few broadly defined groups. 


Sequences 

Bio::Seq is the main sequence object in Bioperl. 

Bio::PrimarySeq iS a sequence object without features. 

Bio::SeqIO provides sequence file input and output. 

Bio::Tools::SeqStats provides statistics on a sequence. 
Bio::LiveSeq::* handles changing sequences. 
Bio::Seq::LargeSeq provides support for very large sequences. 

Databases 

Bio::DB::GenBank provides GenBank access. Similar modules are available for several 

piclooleal databases. 

Bio::Index::* indexing and accessing local databases. 

Bio::Tools::Run::StandAloneBlast runs BLAST on your local computer. 

Bio::Tools::Run::RemoteBlast runs BLAST remotely. 

Bio::Tools::BPlite parses BLAST reports. 

Bio::Tools::BPpsilite parses psiblast reports. 

Bio::Tools::HMMER::Results parses HMMER hidden Markov model results. 

Alignments 

Bio::SimpleAlign manipulates and displays simple multiple sequence alignments. 

Bio::UnivAln manipulates and displays multiple sequence alignments. 

Bio::LocatableSeq are sequence objects with start and end points for locating 

saan to other sequences or alignments. 

Bio::Tools::pSW aligns two sequences with the Smith-Waterman algorithm. 

ae ::Tools::BPbl2seq is a lightweight BLAST parser for pairwise sequence alignment 

using the BLAST algorithm. 

Bio::AlignIo also aligns two sequences with BLAST. 

Bio::Clustalw is an interface to the Clustalw multiple sequence alignment package. 

Bio::TCoffee is an interface to the TCoffee multiple sequence alignment package. 

Bio::Variation: :Allele handles sets of alleles. 

Bio::Variation: :SeqDiff handles sets of mutations and variants. 

Features and genes on sequences 

Bio::SeqFeature is the sequence feature object in Bioperl. 

Bio::Tools::RestrictionEnzyme locates restriction sites in sequence. 

Bio::Tools::Sigcleave finds amino acid cleavage sites. 

Bio::Tools: :OddCodes rewrites amino acid sequences in abbreviated codes for 

specific statistical analysis (e.g., a hydrophobic/hydrophilic two-letter alphabet). 

Bio::Tools::SeqPattern provides support for regular expression descriptions of 

sequence patterns. 

Bio::LocationI provides an interface to location information for a sequence. 

Bio::Location::Simple handles simple location information for a sequence, both as 

a single location and as a range. 

Bio::Location::Split provides location information where the location may 

encompass multiple ranges, and even multiple sequences. 

Bio::Location::Fuzzy provides location information that may be inexact. 

Bio::Tools::Genscan is an interface to the gene finding program. 

Bio::Tools::Sim4::Results (and Exon) is an interface to the gene exon finding 

program. 
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:Tools: 
:Toolks: 
2 LoGd is: 
sTools¢ 
sToois: 


:ESTScan is an interface to the gene finding program. 
:MZEF is an interface to the gene finding program. 
:Grail is an interface to the gene finding program. 
:Genemark is an interface to the gene finding program. 
:EPCR parses the output of ePCR program. 








4 PREVIOUS || NEXT > 
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9.6 bptutorial.pl 


I've already shown you a little of the bptutorial.pl document. I ran and discussed a few of 
the short example programs in the preceding sections. 


As you know, one of the easiest ways to get started with a programming system is to find 
some working and fairly generic programs in that system. You can read and run the 
programs, and then proceed to alter them using them as templates for your own 
programming development. 


Bioperl comes with a directory of example programs, but the best place to begin looking for 
starting-off program code is right in the bptutorial.p1l document itself. That .p1 suffix on 
the name is the giveaway; the document is actually itself a program, cleverly designed so that 
you can read, and run, example programs that exercise the core parts of the Bioperl project. 


The following explanation of the runnable programs that are part of bptutorial.pl appears 
at the end of the document (when you view it on the Web or as the output of perldoc 
bptutorial.pl). 


V.2 Appendix: Tutorial demo scripts 


The following scripts demonstrate many of the features of 
bioperl. To run all the core demos, run: 


> perl -w bptutorial.pl 0 
To run a subset of the scripts do 
> perl -w bptutorial.pl 
and use the displayed help screen. 
It may be best to start by just running one or two demos 
at a time. For example, to run the basic sequence manipu- 


lation demo, do: 


> perl -w bptutorial.pl 1 





Some of the later demos require that you have an internet 
connection and/or that you have an auxilliary bioperl 
library and/or external cpan module and/or external pro- 
gram installed. They may also fail if you are not running 
under Linux or Unix. In all of these cases, the script 
should fail "gracefully" simply saying the demo is being 
skipped. However if the script "crashes", simply run the 
other demos individually (and perhaps send an email to 
bioperl-l@bioperl.org detailing the problem :-). 











(Recall that the -w flag to Perl turns on warnings in almost the same manner as a use 
warnings; directive.) 


To test my Bioperl installation, I started by running the basic sequence manipulation demo as 
suggested. 


First, I thought I might copy the bptutorial.pl program file into my own working directory 
from the Bioperl distribution directory where I'd unpacked the source code. I wanted to put it 
in my own directory so as not to muddy up the Bioperl distribution directory with my own 
extraneous files. However, I discovered that the tutorial demo programs rely on a number of 
datafiles that are found in the t/data/ subdirectory of the Bioperl distribution. Running the 
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program in my own directory gave me an error because the program evidently requires a 
FASTA-formatted file called dnal. fa: 


[tisdall]$ perl -w bptutorial.pl 1 


Beginning sequence manipulations and SeqIO exampl 








SSS EXCRETION = =SsSSsssS Sse 

MSG: Could not open t/data/dnal.fa: No such file or directory 
STACK Bios :Root:s :1lOrk 1natialize. 16 

/usr/local/lib/perl5/site perl/5.8.0/Bio/Root/I0O. 

pm: 264 
STACK Bio::SeqI0O:: initialize 
/usr/local/lib/perl5/site perl/5.8.0/Bio/SegI0O.pm: 437 
STACK Biot: SeQ1lOtsiastas+ initialize 
/usr/local/lib/perl5/site perl/5.8.0/Bio/SeqI0O/ 
fasta.pm: 80 
STACK Bio::SeqIO::new /usr/local/lib/perl5/site_ perl/5.8.0/Bio/SeqIO.pm:355 
STACK Bio::SeqIO::new /usr/local/lib/perl5/site perl/5.8.0/Bio/SeqIO.pm: 368 
STACK main:: ANON _ bptutorial.pl1:2758 

STACK toplevel bptutorial.p1l:3933 









































[tisdall]$ 


I decided it would be easier to run bptutorial.pl from the distribution directory. I entered 
that directory; on my Linux machine, I unpacked the Bioperl source code into 
/usr/local/src/bioperl-1.2.1. I then tried to run the demo: 


[tisdall]$ cd /usr/local/src/bioperl-1.2.1 
[tisdall]$ perl -w bptutorial.pl 1 


Beginning sequence manipulations and SeqIO exampl 

First sequence in fasta format... 

>Testl 
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGAT TAAAAAAAGAGTGTC 
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG 
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC 
ACAACATCCATGAAACGCATTAGCACCACC 

Seq object display id is Testl 

Sequence is 





AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGAT TAAAAAAAGAGTGTCTGATAGCAGCTTCTGAA 


CTGGTTAC 


CTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACT TTAACCAATATAGGCATAGCGCACAGACAG 


ATAAAAATT 
ACAGAGTACACAACATCCATGAAACGCATTAGCACCACC 

Sequence from 5 to 10 is TTTCAT 

Acc num is unknown 

Moltype is dna 

Primary id is Testl 

Truncated Seq object sequence is TTTCAT 

Reverse complemented sequence 5 to 10 is ATGAAA 
Translated sequence 6 to 15 is LORAICLCVD 














Beginning 3-frame and alternate codon translation example... 
ctgagaaaataa translated using method defaults : LRK* 
ctgagaaaataa translated as a coding region (CDS): MRK 


Translating in all six frames: 
frame: 0 forward: LRK* 








frame: 0 reverse-complement: LFSQ 
frame: 1 forward: *EN 
frame: 1 reverse-complement: YFL 
frame: 2 forward: EKI 
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frame: 2 reverse-complement: IFS 
Translating with all codon tables using method defaults: 

















1 : LRK* 
2 L*K* 
S. <: TRK* 
4 : LRK* 
5 2. Sk 
6 : LRKQ 
9 2) LSN* 
10 * LRK* 
11 s LRK* 
L? + SRK* 
13 : LGK* 
LA. &. LSNY 
15: = LRK* 
16 ¢ LRK* 
21 : LSN* 
[tisdall1]$ 


That seemed to run without error. Now I wanted to see the Perl code that had run and 
produced that output. 


Exploring a bit, I found that the POD documentation part of bptutorial.pl is roughly the first 
half of the file, and that the second half of the file contains the Perl code for the demos. 


The author attributions and the libraries that are loaded are at the beginning of the Perl code 
section, somewhere near the middle of the bptutorial.pl file: 


























!/usr/bin/perl 
# PROGRAM : bptutorial.pl 
PURPOSE : Demonstrate various uses of the bioperl package 
AUTHOR : Peter Schattner schattner@alum.mit.edu 
# CREATED : Dec 15 2000 
REVISION : $Id: ch0O9.xml,v 1.2 2003/10/03 19:16:55 becki Exp chodacki $ 


use strict; 

use Bio::SimpleAlign; 
use Bio::AlignI0; 

use Bio::SeqI0O; 

use Bio::Seq; 





That seems like a good bet for a list of the most important Bioperl modules. Actually, some 
other modules are loaded for individual demos; at various points, the following come into 
play: 

use Bio::SearchI0O; 

use Bio::Root::I10; 

use Bio::MapI0O; 

use Bio::Treel0O; 

use Bio::Perl 


Following those use Module directives comes the following: 








subroutine references 


my (Saccess_ remote db, S$index local_db, $fetch_local_db, 
$sequence manipulations, $seqstats_and_seqwords, 
Srestriction_and_sigcleave, Sother seq utilities, $run_remoteblast, 
$run_standaloneblast, S$blast_parser, S$bplite parsing, $hmmer parsing, 
$run_clustalw_tcoffee, Srun_psw_bl2seq, S$simplealign, 
$gene prediction parsing, $sequence annotation, $largeseqs, 
S$run_tree, S$run_map, S$run_struct, $run_perl, $searchio_parsing, 
S$liveseqs, $demo_ variations, $demo_xml, S$display_ help, $bpinspectl ); 





# global variable file names. Edit these if you want to try 
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#out a tutorial script on a different file 


Biot tROOt?tTO=Scattrile("t", "data", "ecolitst. fa") ¢ 


S$dna_seq fil 


used in Sothe 
Samino_seq f 


used in Sblas 
Sblast_repor 





used in Sbpli 
Sbp_parse fi 


used in Sbpli 
Sbp_parse fi 








used in Sbpli 
Sbp_parse fi 


HU 


my Sunaligned_a 








# used in Ssimp 





e= 





ile = Bio::Roo 
t_parser 
GC Fale: = Bios 3 
te parsing 

lel = Bio::Roo 


te parsing 
le2 = Bio::Roo 





te parsing 
le3 = Bio::Roo 


mino file = 





lealign 











my Saligned_ami 





# other global 
my (@runlist, § 





no_ file = 


variables 
n ); 





used in $sequence manipulations 
Bio::Root::10->catfile("t","data","dnal.fa"); 


,210=>ca 


$71O=>ca 


2710=>ca 


sed in $run_clustalw_tcoffee 
Bio::Root::10->catfile(" 








tfile("t", "da 
tfile("t", "da 
efile ("t", "da 


= ad 


r seq utilities and in $run_perl and Ssequence_ annotation 
Eero =Scatinle ('r", "data", “Cysorotl. fa") 


ROOticbO=Scatiale ("C"; "data"; "blast... .eport™)¢ 


ta","blast.report"); 


ta", "bl2seq.cout™) 


1 
7 . 





data”, "evysprot la, ia™) 7 


Bios: ROOts ?LO=Scatfile("t", "data", "testaln.pram") ; 


A look at the documentation (perldoc Bio: :Root::10) shows that the catfile method 
returns the pathname of a file, which is being saved in a scalar variable such as 
$dna_seq_file. This method is there for portability between operating systems, because 
operating systems each have their own syntax for specifying pathnames. 


After that comes the code for the help screen, the output of which you've already seen; next 


comes the individual demo subroutines; and finally the code for the main subroutine, which 


you saw in the las 


t section. 


At the very end of the bptutorial.pl file, I found the part of the code that launches all the 


demos: 


## "main" progr 


am follows 
sy 


@ARGV; 








#no strict ‘ref 
@runlist = 
if (scalar ( 

argument 
if (S$runlis 
foreach $n 
if (Sn 
last; } 

ae CS 
ak Sa 
ae 1CSry 
ae Sn 
a CS ny 
ae 1CSr 
ae Sry 
tk (Sn 
if (Sn 

Le (Sm = 


@runlist)= =0 


) 


# display help if no option 


{&Sdisp 


lay_help; }; 


= 0 means run tests 1 thru 22 


t{0] = = 0) 
(@runlist) { 
= -=100) 


ll 
ll 
PoODTAUOBWNHEH 


{@runlist 


{my Sobject = 


= (1. 


see) ge NF 


Srunlist[1]; 


{&Sbplite parsing; next; } 
{&Shmmer parsing; next; } 
{&Ssimplealign ; 
) {&$gene prediction parsing; next; } 


next; } 


&Sbpinspectl1 (Sobject) ; 


) {&$sequence manipulations; next; } 

) {&$seqstats_and_seqwords; next; } 

) {&$restriction_and_sigcleave; next; } 
) {&$other seq utilities; next; } 
) {&$run_perl; next; } 

) {&$searchio parsing; next; } 
) 

) 

) 

0 


ta", "psiblastreport..out") 7 


Page 308 








if ($n = =11) {&Saccess_ remote db; next; } 

if ($n = =12) {&$index local_db; next; } 

if ($n = =13) {&$fetch_local_db; next; } 

if ($n = =14) {&$sequence annotation; next; } 
if (Sn = =15) {&Slargeseqs; next; } 

if (Sn = =16) {&Sliveseqs; next; } 

if ($n = =17) {éeSrun_struct; next; } 

if ($n = =18) {&$demo variations; next; } 

if ($n = =19) {&$demo_xml; next; } 

if ($n = =20) {&$run_tree; next; } 

if ($n = =21) {&$run_map; next; } 

if ($n = =22) {&$run_remoteblast; next; } 

if ($n = =23) {&$run_standaloneblast; next; } 
if ($n = =24) {&$run_clustalw_tcoffee; next; } 
if ($n = =25) {&$run_psw_bl2seq; next; } 





&$display help; 
} 


## End of "main" 





So, I searched for sequence manipulation and found the following code for this, the first of 
the bptutorial.pl demos: 


HERGE EEE EHH HEE EEE EE EE EEE EEE EE EE EE EE HE EEE EEE EE HH 
# sequence manipulations ( ): 
# 


$sequence manipulations = sub { 


my (Sinfile, Sin, Sout, Sseqobj); 
Sinfile = $dna_seq file; 


print "\nBeginning sequence manipulations and SeqIO example... \n"; 


# III.3.1 Transforming sequence files (SeqIO) 





Sin = Bio::SeqIO->new('-file' => Sinfile , 
'-format' => 'Fasta'); 
$seqobj = S$in->next_seq( ); 





# perl "tied filehandle" syntax is available to SeqI0O, 

# allowing you to use the standard <> and print operations 
# to read and write sequence objects, eg: 

#Sout = Bio::SeqIO->newFh('-format' => 'EMBL'); 











Sout = Bio::SeqIO->newFh('-format' => 'fasta'); 


print "First sequence in fasta format... \n"; 
print Sout S$seqobj; 





# III.4 Manipulating individual sequences 


# The following methods return strings 











print "Seq object display id is ", 

Ssegobj->display id( ), "\n"; # the human read-able id of the sequence 
print "Sequence is ", 

Sseqobj->seq( )," \n"; # String of sequence 

print "Sequence from 5 to 10 is ", 

Sseqobj->subseq(5,10)," \n"; # part of the sequence as a string 

print "Acc num is ", 

Ssegobj->accession number( ), " \n"; # when there, the accession number 
print "Moltype is ", 





Page 309 


Sseqobj->alphabet( ), " \n"; # one of 'dna','rna','protein' 





print "Primary id is ", Sseqobj->primary seq->primary id( )," \n"; 
a unique id for this sequence irregardless 
print "Primary id is ", $seqobj->primary id( ), " \n"; 


a unique id for this sequence irregardless 
of its display id or accession number 














The following methods return an array of Bio::SeqFeature objects 
Ssegobj->top SeqFeatures; # The 'top level' sequence features 
$segobj->all_ SeqFeatures; All sequence features, including sub 
seq features 











The following methods returns new sequence objects, 
































but do not transfer features across 
truncation from 5 to 10 as new object 
print "Truncated Seq object sequence is ", 
Sseqobj->trunc(5,10)->seq( ), ™ \n"; 
reverse complements sequenc 
print "Reverse complemented sequence 5 to 10 is ", 
Sseqobj->trunc(5,10)->revcom->seq, "  \n"; 
translation of the sequence 
print "Translated sequence 6 to 15 is ", 
Sseqobj->translate->subseq(6,15), " \n"; 
my Sc = shift; 
Sc ||= 'ctgagaaaataa'; 





print "\nBeginning 3-frame and alternate codon translation example... 
\n"; 


my Sseq = new Bio::PrimarySeq('-SEQ' => Sc, 

"-ID' => 'no.One'); 
print "Sc translated using method defaults a 
Sseq->translate->seq, "\n"; 





# Bio::Seq uses same sequence methods as PrimarySeq 

my Sseq2 = new Bio::Seq('-SEQ' => Sc, '-ID' => 'no.Two'); 
print "Sc translated as a coding region (CDS): ", 
Sseq2->translate(undef, undef, undef, undef, 1)->seq, "\n"; 











print "\nTranslating in all six frames:\n"; 


my @frames = (0, 1, 2); 

foreach my Sframe (@frames) { 
print " frame: ", Sframe, " forward: ", 
Sseq->translate(undef, undef, $frame)->seq, "\n"; 
print " frame: ", Sframe, " reverse-complement: ", 


Sseq->revcom->translate(undef, undef, S$frame)->seq, "\n"; 


print "Translating with all codon tables using method defaults:\n"; 
my @codontables = qw( 12 3 45 69 101112131415 16 21 ); 
foreach my $ct (@codontables) { 

prin’ Sct, "2a Wy 

Sseq->translate(undef, undef, undef, Sct)->seq, "\n"; 





return 15 


kG 
HHPEPEPEPEPRSRERERERET EEE ES ESSE EE TEE TE 
J PREVIOUS | 


4 
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9.7 bptutorial.pl: sequence_manipulation Demo 


In this section, I'll go through the code for the demo subroutine sequence manipulation that 


was shown in the last section. 
The subroutine is actually an anonymous subroutine; a reference to the subroutine is saved in 
the scalar reference variable $sequence_ manipulation: 


S$sequence manipulations = sub { 


} 


The first few lines of code declare some variables with my. Notice that these are not being 
passed in as arguments; this method uses no arguments but does occasionally use global 
variables such as $dna_seq file, which, as you've just seen, contain the pathname of the 
input sequence file the demo will use: 

my (Sinfile, Sin, Sout, $seqobj); 

Sinfile = $dna_seq file; 


print "\nBeginning sequence manipulations and SeqIO example... \n"; 





The code is cross-referenced to the tutorial sections of the file. The next comment line refers 
to the part of the document: 





# III.3.1 Transforming sequence files (SeqIO) 


which can be looked up in the table of contents to the document for further reading: 


TII.3 Manipulating sequences 
TIIT.3.1 Manipulating sequence data with Seq methods (Seq) 


Now, I'll take a look at the first section of example code in the sequence manipulations 
method: 


# III.3.1 Transforming sequence files (SeqIO) 





Sin = Bio::SeqIO->new('-file' => Sinfile ,'-format' => 'Fasta'); 
$seqobj = Sin->next_seq( ); 





# perl "tied filehandle" syntax is available to SeqIO, 

# allowing you to use the standard <> and print operations 
# to read and write sequence objects, eg: 

#Sout = Bio::SeqlO->newFh('-format' => 'EMBL'); 











Sout = Bio::SeqIO->newFh('-format' => 'fasta'); 


pint 
print 


"First sequence in fasta format... \n"; 
Sout Sseqobj; 








The code starts with a call to the new object constructor of the Bio: :SeqI0O class. The new 
method is being passed the pathname to a FASTA file in $infile, and told that the format is 
FASTA. 





A quick look at the Bio: :SeqIo documentation explains that the call to Bio: :SeqIO->new 
returns a stream object for the specified format. So, Sout is a stream object (a stream is 
input or output of data) for FASTA-formatted data, and Sin is a stream object for 
FASTA-formatted input from the file named in the $infile variable. These $in and Sout 
objects are also filehandles. 


After the Sin object is initialized on the FASTA file named in $infile, it calls the next_seq 
method, which gets the next (in this case, the first and perhaps only) FASTA record from the 
file, and it creates a sequence object $seqobj. The output Sout object is created. The Perl 


Page 312 


print statement is then called, using $out as a filehandle, and printing $seqob3. This prints 
the first FASTA record from that file, as shown in the demo output in the last section and 
repeated here: 


First sequence in fasta format... 

>Testl 
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGAT TAAAAAAAGAGTGTC 
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG 
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC 
ACAACATCCATGAAACGCATTAGCACCACC 


The next part of the sequence manipulation demo code shows many of the methods for 
extracting information from a sequence object: 


# III.4 Manipulating individual sequences 





# The following methods return strings 




















print "Seq object display id is ", 
Sseqobj->display id( ), "\n"; # the human read-able id of the sequence 
print "Sequence is ", 
Sseqobj->seq( )," \n"; # string of sequence 
print "Sequence from 5 to 10 is ", 
Sseqobj->subseq(5,10)," \n"; # part of the sequence as a string 
print "Acc num is ", 
$seqobj->accession number( ), " \n"; # when there, the accession number 
print "Moltype is ", 
Sseqobj->alphabet( ), " \n"; # one of 'dna','rna','protein' 
print "Primary id is ", $seqobj->primary_seq->primary id( )," \n"; 
a unique id for this sequence irregardless 
print "Primary id is ", $seqobj->primary id( ), " \n"; 
a unique id for this sequence irregardless 


of its display id or accession number 





# The following methods return an array of Bio::SeqFeature objects 
Sseqobj->top_SeqFeatures; The 'top level' sequence features 
Sseqobj->all_ SeqFeatures; All sequence features, including sub 
seq features 














The following methods returns new sequence objects, 
but do not transfer features across 

truncation from 5 to 10 as new object 

print "Truncated Seq object sequence is ", 
Ssegobj->trunc(5,10)->seq( ), " \n"; 

reverse complements sequenc 

print "Reverse complemented sequence 5 to 10 is ", 
Sseqobj->trunc(5,10)->revcom->seq, " \n"; 
translation of the sequence 

print "Translated sequence 6 to 15 is ", 
Sseqobj->translate->subseq(6,15), " \n"; 
































This section of the tutorial produced the following output (slightly reformatted to fit the pages 
of this book): 


Seq object display id is Testl 

Sequence is AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGA 
TAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTT TAACC 
AATATAGGCATAGCGCACAGACAGATAAAAAT TACAGAGTACACAACATCCATGAAACGCATTAGCACCACC 
Sequence from 5 to 10 is TTTCAT 

Acc num is unknown 

Moltype is dna 

Primary id is Testl 

Truncated Seq object sequence is TTTCAT 

Reverse complemented sequence 5 to 10 is ATGAAA 

Translated sequence 6 to 15 is LORAICLCVD 
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This part of the demo code is self-explanatory. As you can see, the methods return: 
e The FASTA ID (the part immediately following the > sign in the first line) 
e The sequence alone as a string (reformatted with line breaks to fit this book) 


e The accession number if it is known (FASTA format won't have this, but Genbank 
format will, for instance) 


e The type of molecule (dna, rna, or protein) 


e A unique ID compared to other IDs in the running program but drawn from the file itself 
if possible 


e The sequence of a new truncated sequence object defined from the original sequence 
object; a reverse complement 


e The peptide resulting from a translation of a specified part of a sequence 


The final section of the sequence_manipulation demo demonstrates the ability to translate 
nucleotides into all six reading frames using alternate codon translation tables if desired: 


my Sc = shift; 





Sc ||= 'ctgagaaaataa'; 

print "\nBeginning 3-frame and alternate codon translation example... \n"; 
my Sseq = new Bio::PrimarySeq('-SEQ' => Sc, '-ID' => 'no.One'); 

print "Sc translated using method defaults ae 


Sseq->translate->seq, "\n"; 





# Bio::Seq uses same sequence methods as PrimarySeq 

my Sseq2 = new Bio::Seq('-SEQ' => Sc, '-ID' => 'no.Two'); 
print "Sc translated as a coding region (CDS): ", 
Sseq2->translate(undef, undef, undef, undef, 1)->seq, "\n"; 











print "\nTranslating in all six frames:\n"; 


my @frames = (0, 1, 2); 

foreach my Sframe (@frames) { 
print " frame: ", Sframe, " forward: ", 
Sseq->translate(undef, undef, S$frame)->seq, "\n"; 
print " frame: ", Sframe, " reverse-complement: ", 


Sseq->revcom->translate(undef, undef, S$frame)->seq, "\n"; 


} 





print "Translating with all codon tables using method defaults:\n"; 
my @codontables = qw( 12 3 4 5 69 101112 1314 15 16 21 ); 
foreach my $ct (@codontables) { 

PELNE. Set) x 

Sseq->translate (undef, undef, undef, S$ct)->seq, "\n"; 


} 
This produces the output: 





Beginning 3-frame and alternate codon translation example... 
ctgagaaaataa translated using method defaults : LRK* 
ctgagaaaataa translated as a coding region (CDS): MRK 


Translating in all six frames: 
frame: 0 forward: LRK* 





frame: 0 reverse-complement: LFSQ 
frame: 1 forward: *EN 
frame: 1 reverse-complement: YFL 
frame: 2 forward: EKI 
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frame: 2 reverse-complement: IFS 
Translating with all codon tables using method defaults: 








1 : LRK* 
2 2 L*K* 
S. <: TRK* 
4 : LRK* 
5S 4. bSK* 
6 : LRKQ 
9 2) LSN* 
10 * LRK* 
11 s LRK* 
L? + SRK* 
13 : LGK* 
LA. &. LSNY 
15: = LRK* 
16 = LRK* 
21: LSN* 








Again, this code is fairly easy to follow, although you may find yourself turning to the 
documentation for some of the finer points when you try to use these objects and methods 
in your own code. 


This part of the code starts by getting an argument of sequence data if one is available to be 
shifted off of the @ argument array and into the variable $c; otherwise, it sets the variable $c 
to a preset sequence: 

my Sc = shift; 

Sc ||= 'ctgagaaaataa'; 

Briefly, the methods that follow in this part of the sequence_manipulation demo do the 
following: 


e Call the PrimarySeq: :new constructor to create a lightweight sequence object from 
Sc that doesn't have the extra annotation often present in a standard sequence file 
(see perldoc Bio::PrimarySeq for the whole story) 





e Translate the sequence (from $c) 


e Call the Bio: :Seq constructor to create the S$seq2 sequence object (from the same 
sequence data in $seq) 


e Translate the CDS in the sequence object $seq2 
e Translate the sequence in all six reading frames 
e Translate the sequence using all 21 defined codon tables 


The translate method seems to have lots of interesting options. However, if you try to look 
up the documentation for this method, you may have difficulty finding it. The problem is that 
Bioperl classes often make considerable use of inheritance. Say that the code is calling a 
method on a Bio:: PrimarySeq object, as do the following lines from the demo (somewhat 
separated in the original): 





my Sseq = new Bio::PrimarySeq('-SEQ' => Sc, '-ID' => 'no.One'); 
Sseq->translate(undef, undef, $frame)->seq, "\n"; 





You may try perldoc Bio::PrimarySeq and not find a discussion of the translate method 
because it is being inherited from some other class. And since multiple inheritance is possible, 
it can take considerable effort to track down this method documentation by figuring out the 
parent classes and searching the documentation. 


There's a very convenient way to find the documentation for a method, even if it's only 
inheritedit's built into the bptutorial.pl script. In the example just mentioned, you ask for 
the list of methods used by the Bio: :PrimarySeq method. 
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[tisdall]$ perl bptutorial.pl 100 Bio::PrimarySeq 





*x**xMethods for Object Bio::PrimarySeq ******xx 


Methods taken from package Bio::Describablel 
description display name 


Methods taken from package Bio::Identifiablel 


authority ISLC. SCELNG namespace namespace string object_id version 


Methods taken from package Bio::PrimarySeq 
accession direct seq set validate seq 





Methods taken from package Bio::PrimarySeqI 








accession number alphabet can_call_ new desc display id id 
Lis Ci neuler length moltype primary id revcom seq 
subseq translate trunc 


Methods taken from package Bio::Root::Root 
DESTROY debug verbose 





Methods taken from package Bio::Root::RootI 





carp confess deprecated new stack trace stack trace dump 
throw throw _ not implemented warn warn _not_implemented 
[tisdall]$ 





Sure enough, there under the heading "Methods taken from package Bio: :PrimarySeqI" 
appears the method translate. You should then look at the Bio: : PrimarySeqI 
documentation and find the following method description: 


translate 
Title : translate 
Usage : Sprotein_seq_obj = $dna_seq obj->translate 


#if full CDS expected: 

Sprotein_seq obj = 
Scds_seq_ obj->translate (undef, undef, undef,undef,1); 
Function: 








Provides the translation of the DNA sequence using full 
IUPAC ambiguities in DNA/RNA and amino acid codes. 














The full CDS translation is identical to EMBL/TREMBL 
database translation. Note that the trailing terminator 
character is removed before returning the translation 
object. 








Note: if you set $dna_seq_obj->verbose(1) you will get a 
warning if the first codon is not a valid initiator. 


Returns : A Bio::PrimarySeqI implementing object 

Args : character for terminator (optional) defaults to '*' 
character for unknown amino acid (optional) defaults to 'X' 
frame (optional) valid values 0, 1, 2, defaults to 0 
codon table id (optional) defaults to 1 
complete coding sequenc xpected, defaults to 0 (false) 
boolean, throw exception if not complete CDS (true) or 
defaults to warning (false) 








You will want to explore at least a few more of the tutorials embedded in bptutorial.pl, as 


we've done here with the first of the tutorials. 


NEXT > 
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9.8 Using Bioperl Modules 


I've reached my stated goal: to get you started using Bioperl. Needless to say, there is a 
great deal more to explore than will fit into the confines of this chapter. 


For those who wish to continue, here is a short list of some of the interesting and useful parts 
of Bioperl that will repay your efforts to learn them with considerably increased programming 
power: 

e Overview of Bioperl objects 

e Seq objects 

e Gbrowse and dff files 


e BLAST parsing 


e Automated database searching 
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Appendix A. Perl Summary 


This appendix summarizes those parts of the Perl programming language that will be most 
useful to you as you read this book. It is not a comprehensive summary of the Perl language. 
Remember that Perl is designed so that you don't need to know everything in order to use it. 
Source material for this appendix came from Programming Perl (O'Reilly). 
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A.1 Command Interpretation 


The Perl programs in this book start with a line something like: 

#!/usr/bin/perl 

On Unix (or Linux) systems, the first line of a file can include the name of a program and 
some flags, which are optional. The line must start with #!, followed by the full pathname of 
the program (in this case, the Perl interpreter), followed optionally by a single group of one or 
more flags. It's common in Perl programs to see the -w flag on this first command interpreter 
line, like so: 


#!/usr/bin/perl -w 
The -w flag turns on extra warnings. I prefer to do that with the line: 
use warnings; 


because it's more portable to different operating systems. 


If the Perl program file is called myprogram and has executable permissions, you can type 
myprogram (or possibly ./myprogram or the full or relative pathname for the program) to 
start the program running. 


The Unix operating system starts the program specified in the command interpretation line 
and gives it as input the rest of the file after the first line. So, in this case, it starts the Perl 
interpreter and gives it the program in the file to run. 


This is just a shortcut for typing the following at the command line: 


/usr/bin/perl myprogram 
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A.2 Comments 


A comment begins with a # sign and continues from there to the end of the same line. It is 


ignored by the Perl interpreter and is only there for programmers to read. A comment can 
include any text. 
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A.3 Scalar Values and Scalar Variables 
A scalar value is a single item of data, such as a string, a number, or a reference. 
A.3.1 Strings 


Strings are scalar values and are written as text enclosed within single quotes, like so: 
'This is a string in single quotes.' 

or double quotes, such as: 

"This is a string in double quotes." 

A single-quoted string prints out exactly as written. With double quotes, you can include a 


variable in the string, and its value will be inserted or "interpolated." You can also include 
commands such as \n to represent a newline (see Table A-3): 


Saside = '(or so they say)'; 

Sdeclaration = "Misery\n Saside \nloves company."; 
print S$declaration; 

This snippet prints out: 


Misery 
(or so they say) 
loves company. 


A.3.2 Numbers 


Numbers are scalar values that can be: 


e Integers: 
e 3 
e -4 

0 


e Floating-point (decimal): 
4.5326 


e Scientific (exponential) notation (3.13 x 1023 or 313000000000000000000000): 
313823 





e Hexadecimal (base 16): 
Ox12bc3 
e Octal (base 8): 
O5777 
e Binary (base 2): 
0b10101011 
Complex (or imaginary) numbers, such as 3 + i, and fractions (or ratios, or rational 
numbers), such as 1/3, can be a little tricky. Perl can handle fractions but converts them 


internally to floating-point numbers, which can make certain operations go wrong (Perl is not 
alone among computer languages in this regard): 


if ( 10/3 == 4 (1/3). * HO) 4 
print “Success !"; 
jelse { 


print "Failure!"; 


} 
This prints: 
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Failure! 


To properly handle rational arithmetic with fractions, complex numbers, or many other 
mathematical constructs, there are mathematics modules available, which aren't covered 
here. 


A.3.3 References 


Perl references are scalar values that contain a location in memory where some other Perl 
data can be found. Think of references as scalar values that point to other data, such as 
scalars like strings or numbers, or arrays, hashes, or subroutines. 


References are mainly used to: 

e Pass diverse types of arguments into a Perl subroutine 

e Avoid copying large amounts of data 

e Define the objects of object-oriented programming 

e Build complex data structures 
To create a reference, prepend a Perl value, variable, or subroutine with a backslash. 
To save a reference, assign it to a scalar variable. 


You can refer to a string by preceding it with a backslash: 


\'A referenced string' 


and you make a reference to a number similarly: 
\Soe 1415925 


These references may be saved in scalar variables, discussed next. 
A.3.4 Scalar Variables 


Scalar values can be stored in scalar variables. A scalar variable is indicated with a $ before 


the variable's name. The name begins with a letter or underscore and can have any number 


of letters, underscores, or digits. A digit, however, can't be the first character in a variable 
name. Here are some examples of legal names of scalar variables: 


$Var 

$Svar_1 

Here are some improper names for scalar variables: 
Slvar 


Svar!iable 


Names are case-sensitive: $dna is different from SDNA. 


These rules for making proper variable names (apart from the beginning $) also hold for the 


names of array and hash variables and for subroutine names. 


A scalar variable may hold any type of scalar value mentioned previously, such as strings or 


the different types of numbers. 
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A.4 Assignment 


Scalar variables are assigned scalar values with an assignment operator (the equals sign) in 
an assignment statement: 


Sthousand = 1000; 

assigns the integer 1000, a scalar value, to the scalar variable $thousanda. 

The assignment statement looks like an equal sign from elementary mathematics, but its 
meaning is different. The assignment statement is an instruction, not an assertion. It doesn't 
mean "Sthousand equals 1000." It means "store the scalar value 1000 into the scalar 


variable $thousand". However, after the statement, the value of the scalar variable 
S$thousand is, indeed, equal to 1000. 


References are usually saved in scalar variables. For example: 
Spi = \3.14159265; 


If you try to print $pi after this assignment, you get an indication that it's a reference to a 
scalar value at a memory location represented in hexadecimal digits. To print the value of a 
variable that's a reference to a scalar, precede its name with an additional dollar sign: 


print Spa, an" 
prin’ Sspiy- "wn"? 


This gives the output: 


SCALAR (0x811dlbc) 
32 LAS 9265 


You can assign values to several scalar variables by surrounding variables and values in 
parentheses and separating them by commas, thus making lists: 


(Sone, Stwo, Sthree) = ( 1, 2, 3); 


There are several assignment operators besides = that are shorthand for longer expressions. 
For instance, $a += $b is equivalent to $a = $a + Sb. Table A-1 is a complete list. 


Table A-1. Assignment operator shorthands 


$a x= $b (string $a repeated $b times) 





$a &= $b (bitwise AND) 
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Table A-1. Assignment operator shorthands 














$a >>= $b $a = $a >> $b ($a shift $b bits) 

$a <<= $b $a = $a >> $b ($a shift $b bits to left) 
$a .= $b (append string $b to $a) 
[ Team LiB | 
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A.5 Statements and Blocks 


Programs are composed of statements often grouped together into blocks. 
A statement ends with a semicolon (; ), which is optional for the last statement in a block. 


A block is one or more statements usually surrounded by curly braces: 


Sthousand = 1000; 
print S$thousand; 
} 


Blocks may stand by themselves but are often associated with such constructs as loops or if 
statements. 
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A.6 Arrays 


Arrays are ordered collections of zero or more scalar values, indexed by position. An array 
variable begins with the @ sign followed by a legal variable name. For instance, here are two 
possible array variable names: 

@arrayl 

@dna_fragments 


You can assign scalar values to an array by placing the scalar values in a list separated by 
commas and surrounded by a pair of parentheses. For instance, you can assign an array the 
empty list: 

@array = ( ); 


or one or more scalar values: 
@dna_fragments = ('ACGT', $fragment2, 'GGCGGA'); 


Notice that it's okay to specify a scalar variable such as $fragment2 in a list. Its current value, 
not the variable name, is placed into the array. 


The individual scalar values of an array (the elements) are indexed by their position in the 
array. The index numbers begin at 0. You can specify the individual elements of an array by 
preceding the array name by a $ and following it with the index number of the element within 
square brackets, like so: 


$dna_fragments[2] 


This equals the value of 'GGcGGa', given the values previously set for this array. Notice that 
the array has three scalar values indexed by numbers O, 1, and 2. The third and last element 
is indexed 2, one less than the total number of elements 3, because the first element is 
indexed number O. 


You can make a copy of an array using an assignment operator =, as in this example that 
makes a copy @output of an existing array @input: 


@output = @input; 


If you evaluate an array in scalar context, the value is the number of elements in the array. 
So if array @input has five elements, the following example assigns the value 5 to $count: 


Scount = @input; 


Figure A-1 shows an array @myarray with three elements, which demonstrates the ordered 
nature of an array by which each element appears and can be found by its position in the 
array. 


Figure A-1. Schematic of an array 
Arrays: 
@myarray= ("DNA;"RNA; Protein’); 
Positions: 0 1 2 


Scalarvalues: | ONA | WA | Protein | 


You can make a reference to an array by preceding it with a backslash; you dereference it by 
preceding the reference with an at sign @ for the entire array or with an extra dollar sign $ for 
an individual element of the array: 


@a = ( 'one', 'two', 'three'); 
Saref = ° a; 

print Saref, "\n"; 

print $Saref[0)], "\n"; 

print "@Saref", "\n"; 





gives the output: 
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ARRAY (0x811d1e0) 
one 
one two three 


You can also define an array as an anonymous array. An anonymous array isn't saved ina 
named array variable; it's a reference to array data and can only be saved in a reference. It's 
initialized within square brackets: 


Sanonarray = [ 'one', 'two', 'three']; 
print "@Sanonarray", "\n"; 


gives the output: 


one two three 
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A.7 Hashes 


A hash (also called an associative array) is a collection of zero or more pairs of scalar values, 
called keys and values. The values are indexed by the keys. An array variable begins with the 
% sign followed by a legal variable name. For instance, possible hash variable names are: 
shashl 

sgenes by name 


You can assign a value to a key with a simple assignment statement. For example, say you 
have a hash called sbaseball_stadiums and a key Phillies to which you want to assign 
the value Citizens Bank Park. This statement accomplishes the assignment: 


Sbaseball_ stadiums{'Phillies'} = 'Citizens State Bank'; 


Note that a single hash value is referenced by a $ instead of a % at the beginning of the hash 
name; this is similar to the way you reference individual array values using a $ instead of a @. 


You can assign several keys and values to a hash by placing their scalar values in a list 
separated by commas and surrounded by a pair of parentheses. Each successive pair of 
scalars becomes a key and a value in the hash. For instance, you can assign a hash the 
empty list: 
shash = (_ ); 
You can also assign one or more scalar key/value pairs: 
sgenes by name = ('genel', 'AACCCGGTTGGTT', 'gene2', 'CCTTTCGGAAGGTC') ; 
There is an another way to do the same thing, which makes the key/value pairs more readily 
apparent. This accomplishes the same thing as the preceding example: 
sgenes by name = ( 

"genel' => 'AACCCGGTTGGTT', 

"gene2' => 'CCTTTCGGAAGGTC'! 
); 
To get the value associated with a particular key, precede the hash name with a $ and follow 
it with a pair of curly braces containing the scalar value of the key: 
$genes by name{'genel'} 
This returns the value 'AACCCGGTTGGTT', given the value previously assigned to the key 'genel 
"in the hash sgenes_by name. Figure A-2 shows a hash with three keys. 


Figure A-2. Schematic of a hash 


Hashes: 
Name Assignment — List of key/value pairs 
Somyhash = = ('framel => ‘ACGTACGT, 
‘frame?’ => "CGTACGT! 
Frame3’=> “GTACGT’); 
Keys Values (unordered) 


CGTACGT 
ACGTACGT 


You can get an array of all the keys in a hash with the operator "keys", and you can get an 
array of all the values in a hash with the operator "values". 





The arrays you get won't be sorted. Here's an example: 


sh = ( ‘one! => 'for the money', 
"two! => 'for the show', 
"three! => 'to get ready' 
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); 

@keys = keys Sh; 

@values = values Sh; 
print "@keys\n@values\n"; 
gives the output: 


three one two 
to get ready for the money for the show. 
You can make a reference to a hash by preceding it with a backslash; you dereference it by 


preceding the reference with a percent sign % for the entire hash or with an extra dollar sign $ 
for an individual value of some hash key: 


sh = ( ‘one! => 'for the money', 
"two! => 'for the show', 
"three! => 'to get ready' 


\; 
shret = Yoon; 
print Shret, “\ni"; 
print $Shref{'two'}, "\n"; 
foreach Skey ( keys %Shref ) { 
print "key = Skey value = S$Shref{Skey}\n"; 





} 


gives the output: 


HASH (0x811d1le0) 

for the show 

key = three value = to get ready 
key = one value for the money 
key = two value = for the show 





You can also define a hash as an anonymous hash. An anonymous hash isn't saved in a 
named hash variable; it's a reference to hash data, and can only be saved in a reference. It is 
initialized within curly brackets: 
Sanonhash = { ‘one! => 'first', 

"two' => 'second', 

'three' => 'third' 

}? 
foreach $key ( keys SSanonhash ) { 
print "key = Skey value = S$Sanonhash{Skey}\n"; 

} 


gives the (unsorted) output: 


key = three value = third 

key = one value = first 

key = two value = second 
(reat 
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A.8 Complex Data Structures 


The standard Perl data types, scalar, array, and hash, can be combined into more complex 
data structures using references. The construction of complex data structures is summarized 
in Chapter 2. 


As an example, you can define a two-dimensional matrix as an array of anonymous arrays 
(recall that an anonymous array is a reference to array data): 


@microarray = ( 
[ 10, 2, 14 ], 
[ 15, 4, 54 ], 
[ Sly. 0, 99) ] 

); 
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A.9 Operators 


Operators are functions that represent basic operations on values: addition, subtraction, etc. 
They are frequently used and are core parts of the Perl programming language. They are 
really just functions that take arguments. For instance, + is the operator that adds two 
numbers, like so: 


B op ae 


Operators typically have one, two, or three operands; in the example just given, there are 
two operands 3 and 4. 


Operators can appear before, between, or after their operands. For example, the plus 
operator + appears between its operands. 
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A.10 Operator Precedence 


Operator precedence determines the order in which the operations are applied. For instance, 
in Perl, the expression: 


34+3%* 4 

isn't evaluated left to right, which calculates 3 + 3 equals 6, and 6 times 4 results in a value of 
24; the precedence rules cause the multiplication to be applied first, for a final result of 15. 
The precedence rules are available in the perlop manpage and in most Perl books. However, 


I recommend you use parentheses to make your code more readable and to avoid bugs. 
They make the following expressions unambiguous; the first: 


(3 +.-3) * 4 


evaluates to 24, and the second: 
3 + (3 * 4) 


evaluates to 15. 
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A.11 Basic Operators 


For more information on how operators work, consult the perlop documentation bundled 
with Perl. 


A.11.1 Arithmetic Operators 


Perl has the basic five arithmetic operators: 


+ 
Addition 
Subtraction 
*K 
Multiplication 
/ 
Division 
2K K 


Exponentiation 


These operators work on both integers and floating-point values (and if you're not careful, 
strings as well). 


Perl also has a modulus operator, which computes the remainder of two integers: 


fo) 


% modulus 


For example, 17 % 3 is 2, because 2 is left over when you divide 3 into 17. 


Perl also has autoincrement and autodecrement operators: 
++ add one 
-- subtract one 


Unlike the previous six operators, these change a variable's value. $x++ adds one to $x, 
changing 4 to 5 (or ato b). 


A.11.2 Bitwise Operators 
All scalars, whether numbers or strings, are represented as sequences of individual bits "under 


the hood." Every once in a while, you need to manipulate those bits, and Perl provides five 
operators to help: 


& 
Bitwise and 
Bitwise or 
Bitwise xor 

>> 
Right shift 

<< 
Left shift 
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A.11.3 String Operators 


Two strings may be concatenated, i.e., joined end to end, with the dot operator: 
‘This 16 a ' . 'jouned string" 


This results in the value 'This is a joined string'. 


A string may be repeated with the x operator: 
print "Hear ye! " x 3; 


This prints: 





Hear ye! Hear ye! Hear ye! 
A.11.4 File Test Operators 
File test operators are unary operators that test files for certain characteristics, such as -e 


$file, which returns true if the file $file exists. Table A-2 lists some available file test 
operators. 


Table A-2. File test operators 

















Operator 
-r File is readable 
-W File is writable 
-X File is executable 
-O File is owned by "you" 
-e File exists 
-Z File has zero size in bytes 
-S File has nonzero size (returns size in bytes) 
-f File is a plain file 
-d File is a directory (a.k.a. folder) 
-| File is a symbolic link 
-t Filehandle is opened to a terminal 
-T File is a text file 
-B File is a binary file 
-M Age of file (at startup of program) in days since modification 
-A Age of file (at startup of program) in days since last access 
=C Age of file (at startup of program) in days since last inode change 
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A.12 Conditionals and Logical Operators 
This section covers conditional statements and logical operators. 
A.12.1 true and false 


In a conditional test, an expression evaluates to true or false, and based on the result, a 
statement or block may or may not be executed. 


A scalar value can be true or false in a conditional. A string is false if it's the empty string 
(represented as "" or"). A string is true if it's not the empty string. 


Similarly, an array or a hash is false if empty and true if nonempty. 
A number is false if it's 0; a number is true if it's not 0. 


Most things you evaluate in Perl return some value (such as a number from an arithmetic 
expression or an array returned from a subroutine), so you can use most things in Perl in 
conditional tests. Sometimes you may get an undefined value, for example, if you try to add 
a number to a variable that has not been assigned a value. Things might then fail to work as 
expected. For instance, the following: 

use strict; 

use warnings; 

my $a; 

my $b; 

Sb = Sa + 2; 


produces the warning output: 
Use of uninitialized value in addition (+) at - line 5. 


You can test for defined and undefined values with the Perl function defined. 


A.12.2 Logical Operators 


There are four logical operators: 


not 
and 
Or 

xor 


not turns true values into false and false values into true. Its use is best illustrated in 
code: 


if(not Sdone) {...} 


This executes the code only if $done is false. 


and is a binary operator that returns true if both its operands are true. If one or both of the 
operands are false, the operator returns false: 


1 and 1 returns true 
"a" saad: “ot returns false 
‘rt and 0O returns false 


or is a binary operator that returns true if one or both of the operands are true. If both 
operands are false, it returns false: 


1 or 1 returns true 
Vat ooge "4 returns true 
'r sor 0 returns false 


xor, or exclusive-OR, returns true if one operand is true and the other operand is false; 
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xor returns false if both operands are true or if both operands are false: 


iL. on (0 returns true 
OQ - o% 1 returns true 
1 xor 1 returns false 
0 sor 0 returns false 


There are also variants on most of these: 
l Poe not 
&& for and 
|. i. SEGae 6s 


These have different precedence but otherwise behave the same. Some older versions of Perl 
may only have: 

! 

| | 

&& 


instead of not or and. 


A.12.3 Using Logical Operators for Control Flow 


A quick and popular way to take an action, depending on the results of a previous action, is 
to chain the statements together with logical operators. For instance, it's common in Perl 
programs to see the following statement to open a file: 


open(FH, $filename) or die "Cannot open file $filename: $!"; 


The use of or in this statement shows another important thing about the binary logical 
operators: they evaluate their arguments left to right. In this case, if the open succeeds, the 
or operator never bothers to check the value of the second operand (die, which exits the 
program with the message in the string, plus additional messages if $! is included). The or 
never bothers, because if one operand is true, the or is true, so it doesn't need to check the 
second operand. However, if the open fails, the or needs to check that the second operand 

is true or false, so it goes ahead and executes the die statement. 


You can use the and statement similarly to test the second operand only if the first operand 
succeeds. 


xor doesn't work for control flow, because both its arguments have to be evaluated each 
time. 


I haven't used this chaining of logical operators much; instead, I've used if statements. This 
is because I often find that I want to add more statements following a test, and it's easier if 
the original is written as an if statement with a block, and it's harder if the original is written 
as a logical operator. 


A.12.4 The if Statement 


Conditional tests are commonly found in if statements and in their variants and loops. Here's 
an example of an if statement: 
if (open (FH, $filename) { 


print "Hurray, I opened the file."; 
} 


The if statement is followed by a conditional expression enclosed in parentheses, which is 
followed by a block enclosed in curly braces. When the conditional expression evaluates as 
true, the statements in the block are executed. 


The if statement may optionally be followed by an else, which is executed when the 
conditional evaluates to false: 


if ( open(FH, $filename) { 
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print "Hurray, 
p-else. 


I opened the file."; 





print "Rats. The file did not open."; 


} 


The if statement may also optionally include any number of elsif clauses, which check 
additional conditional statements if none of the preceding conditional statements are true: 


I opened file 1."; 


Sfile2) { 
I opened file 2."; 
Sfile3) { 


I opened file 3."; 


if ( open(FH, S$filel) { 
print "Hurray, 

} elsif ( open(FH, 
print "Hurray, 

} elsif ( open(FH, 
print "Hurray, 

} else { 
print "None of 











} 


the dadblasted files would open."; 


In the preceding example, if file 1 opened successfully, the if statement doesn't try to open 


additional files. 


There is also an unless statement, which is the same as an if statement with the conditional 
negated. So these two statements are equivalent: 


unless ( open(FH, 


Sfilename) { 


print "Rats. The file did not open."; 


} 


if ( not open(FH, 


Sfilename) { 


print "Rats. The file did not open."; 


wa 
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A.13 Binding Operators 


Binding operators are used for pattern matching, substitution, and transliteration on strings. 
They are used with regular expressions that specify the patterns: 


‘ACGTACGTACGTACGT’ =~ /CTA/ 


The pattern is the string CTA, enclosed by forward slashes. The string binding operator is =~; it 
tells the program which string to search, returning true if the pattern appears in the string. 


!~ is another string binding operator; it returns true if the pattern isn't in the string: 
"ACGTACGTACGTACGT!'! !~ /CTA/ 

This is equivalent to: 

not 'ACGTACGTACGTACGT' =~ /CTA/ 

You can substitute one pattern for another using the string binding operator. In the next 


example, s/thine/nine/ is the substitution command, which substitutes the first occurrence 
of thine with the string nine: 


Spoor richard = 'A stitch in time saves thine.'; 
Spoor richard =~ s/thine/nine/; 
print Spoor richard; 


This produces the output: 


A stitch in time saves nine. 





Finally, the transliteration (or translate) operator tr substitutes characters in a string. It has 


several uses, but the two uses I've covered are first, to change bases to their complements 
A-=T,C—™G,G=—-C, andT A: 


SDNA = 'ACGTTTAA'; 

SDNA =~ tr/ACGT/TGCA/; 

This produces the value: 

TGCAAATT 

Second, the tr operator counts the number of a particular character in a string, as in this 
example which counts the number of Gs in a string of DNA sequence data: 


SDNA = 'ACGTTTAA'; 
$count = (SDNA =~ tr/A//); 
print $count; 


This produces the value 3; it shows that a pattern match can return a count of the number of 
translations made in a string, which is then assigned to the variable $count. 
[ Team LiB ] 
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A.14 Loops 


Loops repeatedly execute the statements in a block until a conditional test changes value. 
There are several forms of loops in Perl: 


while (CONDITION) {BLOCK} 

until (CONDITION) {BLOCK} 

for (INITIALIZATION ; CONDITION ; RE-INITIALIZATION ) {BLOCK} 
foreach VAR (LIST) {BLOCK} 

for VAR (LIST) {BLOCK} 

do {BLOCK} while (CONDITION) 

do {BLOCK} until (CONDITION) 











The while loop first tests if the conditional is true; if so, it executes the block and then 
returns to the conditional to repeat the process. If false, it does nothing, and the loop is 
over: 
oh = Sy 
while ( Si) { 

print. “Sans 

Sioa; 


} 


This produces the output: 


3 
2 
il. 


Here's how the loop works. The scalar variable $i is first initialized to 3 (this isn't part of the 
loop). The loop is then entered, and $i is tested to see if it has a true (nonzero) value. It 
does, so the number 3 is printed, and the decrement operator is applied to $i, which reduces 
its value to 2. The block is now over, and the loop starts again with the conditional test. It 
succeeds with the true value 2, which is printed and decremented. The loop restarts with a 
test of $i, which is now the true value 1; 1 is printed and decremented to 0. The loop starts 
again; 0 is tested to see if it's true, and it's not, so the loop is now finished. 


Loops often follow the same pattern, in which a variable is set, and a loop is called, which 
tests the variable's value and then executes a block, which includes changing the value of the 
variable. 


The for loop makes this easy by including the variable initialization and the variable change in 
the loop statement. The following is exactly equivalent to the preceding example and 
produces the same output: 
for (Si = 3.7% Si. @ Si=—= ) 4 

pramt: "Ss. ia"? 
} 
The foreach loop is a convenient way to iterate through the elements in an array. Here's an 
example: 


@array = ('one', ‘'two', 'three'); 
foreach Selement (@array) { 
print Selement\n"; 
} 
This prints the output: 


one 
two 
three 
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The foreach loop specifies a scalar variable $element to be set to each element of the array. 
(You may use any variable name or none, in which case the special variable $__ is used 
automatically.) The array to be iterated over is then placed in parentheses, followed by the 
block. You can use for instead of foreach as the name of this loop, with identical behavior. 


The first time through the loop, the value of the first element of the array is assigned to the 
foreach variable $element. On each succeeding pass through the loop, the value of the next 
element of the array is assigned to the foreach variable $element. The loop exits after it has 
reached the end of the array. 


There is one important point to make, however. If, in the block, you change the value of the 
loop variable $element, the array is changed, and the change stays in effect after you've left 
the foreach loop. For example the following: 


@array = ('one', ‘'two', 'three'); 


foreach Selement (@array) { 
Selement = 'four'; 


} 


foreach Selement (@array) { 
print Selement,"\n"; 

} 

produces the output: 


four 

four 

four 

In the do-until loop, the block is executed before the conditional test, and the test succeeds 
until the condition is true: 


print S$i,"\n"'y 
Si=S7 
bambi Sa 33 
This prints: 
3 


In the do-while loop, the block is executed before the conditional test, and the test succeeds 
while the condition is true: 


print. $2, \n"'; 
Si==; 
} while ( $i ); 
This prints: 


3 
2 


(ee 


4 PREVIOUS 
NEXT > 


4 
fey 


BP 
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A.15 Input/Output 


This section covers getting information into programs and receiving data back from them. 
A.15.1 Input from Files 


Perl has several convenient ways to get information into a program. The non-CGI programs 
in this book usually get input by opening and reading files. I've emphasized this way of getting 
input because it behaves very much the same way on any computer you may be using. 
You've observed the open and close system calls and how to associate a filehandle with a file 
when you open it, which then is used to read in the data. As an example: 

open (FILEHANDLE, “informationfile"); 

@data from informationfile = <FILEHANDLE>; 

close (FILEHANDLE) ; 






































This code opens the file informationfile and associates the filehandle FILEHANDLE with it. 
The filehandle is then used within angle brackets to actually read in the contents of the file and 
store the contents in the array @data_from_informationfile. Finally, the file is closed by 
referring once again to the opened filehandle. 


A.15.2 Input from STDIN 


Perl allows you to read in any input that is automatically sent to your program via standard 
input (STDIN). STDIN is a filehandle that by default is always open. Your program may be 
expecting some input that way. For instance, on a Mac, you can drag and drop a file icon 
onto the Perl applet for your program to make the file's contents appear in STDIN. On Unix 
systems, you can pipe the output of some other program into the STDIN of your program 
with shell commands such as: 


someprog | my perl program 


You can also pipe the contents of a file into your program with: 


cat file | my perl program 


or with: 


my perl program < file. 


Your program can then read in the data (from program or file) that comes as STDIN just as if 
it came from a file that you've opened: 


@data_from_stdin = <STDIN>; 


A.15.3 Input from Files Named on the Command Line 


You can name your input files on the command line. <> is shorthand for <ARGV>. The ARGV 
filehandle treats the array @ARGV as a list of filenames and returns the contents of all those 
files, one line at a time. Perl places all command-line arguments into the array @ARGV. Some 
of these may be special flags, which should be read and removed from @ARGv if there will also 
be data files named. Perl assumes that anything in @ARGV refers to an input filename when it 
reaches a < > command. The contents of the file or files are then available to the program 
using the angle brackets without a filehandle, like so: 


@data_from_ files = <>; 
For example, on Microsoft, Unix, or on the Mac OS X, you specify input files at the command 
line, like so: 


fo) 


% my program filel file2 file3 


A.15.4 Output Commands 
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The print statement is the most common way to output data from a Perl program. The 
print statement takes as arguments a list of scalars separated by commas. An array can be 
an argument, in which case, the elements of the array are all printed one after the other: 


@array = ('DNA', 'RNA', 'Protein'); 

print @array; 

This prints: 

DNARNAProtein 

If you want to put spaces between the elements of an array, place it between double quotes 
in the print statement, like this: 

@array = ('DNA', 'RNA', 'Protein'); 

print "@array"; 

This prints: 

DNA RNA Protein 

The print statement can specify a filehandle as an optional indirect object between the print 
statement and the arguments, like so: 

print FH "@array"; 

The printf function gives more control over the formatting of the output of numbers. For 
instance, you can specify field widths; the precision, or number of places after the decimal 


point; and whether the value is right- or left-justified in the field. Check the Perl 
documentation that comes with your copy of Perl for all the details. 


The sprintf function is related to the printf function; it formats a string instead of printing it 
out. 





The format and write commands are a way to format a multiline output, as when generating 
reports. format can be a useful command, but in practice it isn't used much. The full details 
are available in your Perl documentation, and O'Reilly's Programming Perl contains an entire 
chapter on format. 


A.15.4.1 Output to STDOUT, STDERR, and files 


Standard output, with the filehandle STDOUT, is the default destination for output from a Perl 
program, so it doesn't have to be named. The following two statements are equivalent unless 
you used select to change the default output filehandle: 


"Hello biology world!\n"; 
STDOUT "Hello biology world! \n"; 


print 
print 
Note that the STDOUT isn't followed by a comma. STDOUT is usually directed to the 


computer screen, but it may be redirected at the command line to other programs or files. 
This Unix command pipes the STDOUT of my_program to the STDIN of your_program: 





my program | your program 
This Unix command directs the output of my program to the file outputfile: 
my program > outputfile 


It's also common to direct certain error messages to the predefined standard error filehandle 
STDERR or to a file you've opened for input and named with a particular filehandle. Here are 
examples of these two tasks: 

print STDERR "If you reached this part of the program, something is terribly 
wrong!"; 


open(OUTPUTFD, ">output_file"); 
print OUTPUTFD "Here is the first line in the output file output_file\n"; 





STDERR is also usually directed to the computer screen by default, but it can be directed into 
a file from the command line. This is done differently for different systems, for example, as 
follows (on Unix with the sh or bash shells): 
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myprogram 2>myprogram.error 


You can also direct STDERR to a file from within your Perl program by including code such as 
the following before the first output to STDERR. This is the most portable way to redirect 


STDERR: 


open 


(STDERR, 





"Smyprogram.error") 


or die "Cannot open error file 


myprogram.error:$!\n"; 


The problem with this is that the original STDERR is lost. This method, taken from 
Programming Perl, saves and restores the original STDERR: 















































open ERRORFILE, ">myprogram.error" 
or die "Can't open myprogram.error"; 
open SAVEERR, ">&STDERR"; 
open STDERR, ">&ERRORFILE; 
print STDERR "This will appear in error file myprogram.error\n"; 
# now, restore STDERR 
close STDERR; 
open STDERR, ">&SAVEERR"; 
print STDERR "This will appear on the computer screen\n"; 








There are a lot of details concerning filehandles not covered in this book, and redirecting one 
of the predefined filehandles such as STDERR can cause problems, especially as your 
programs get bigger and rely more on modules and libraries of subroutines. One safe way is 
to define a new filehandle associated with an error file and to send all your error messages to 








MeMyProgramserron") 
t open myprogram.error:$!\n"; 














ES "This is an error message\n"; 





Note that the die function, and the closely related warn function, print their error messages 


it: 

open (ERRORMESSAGES, 
or die "Canno 

print ERRORMESSAG 

to STDERR. 
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A.16 Regular Expressions 


Regular expressions are, in effect, an extra language that lives inside the Perl language. In Perl, 
they have quite a lot of features. First, I'll summarize how regular expressions work in Perl; 
then, I'll present some of their many features. 


A.16.1 Overview 


Regular expressions describe patterns in strings. The pattern described by a single regular 
expression may match many different strings. 


Regular expressions are used in pattern matching, that is, when you look to see if a certain 
pattern exists in a string. They can also change strings, as with the s/// operator that 
substitutes the pattern, if found, for a replacement. Additionally, they are used in the tr 
function that can transliterate several characters into replacement characters throughout a 
string. Regular expressions are case-sensitive, unless explicitly told otherwise. 


The simplest pattern match is a string that matches itself. For instance, to see if the pattern ' 
abc' appears in the string 'abcdefghijklmnopqrstuvwxyz', write the following in Perl: 


Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
if( Salphabet =~ /abc/ ) { 
print $&; 


} 


The =~ operator binds a pattern match to a string. /abc/ is the pattern abc, enclosed in 
forward slashes to indicate that it's a regular-expression pattern. $& is set to the matched 
pattern, if any. In this case, the match succeeds, since 'abc' appears in the string Salphabet, 
and the code just given prints out abc. 


Regular expressions are made from two kinds of characters. Many characters, such as ‘a' or'' 
z', match themselves. Metacharacters have a special meaning in the regular-expression 
language. For instance, parentheses are used to group other characters and don't match 
themselves. If you want to match a metacharacter such as ( in a string, you have to precede 
it with the backslash metacharacter \ ( in the pattern. 


There are three basic ideas behind regular expressions. The first is concatenation: two items 
next to each other in a regular-expression pattern (that's the string between the forward 
slashes in the examples) must match two items next to each other in the string being 
matched (the Salphabet in the examples). So, to match 'abc' followed by 'def', concatenate 
them in the regular expression: 





Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
if( Salphabet =~ /abcdef/ ) { 
print S&; 
} 
This prints: 


abcdef 


The second major idea is alternation. Items separated by the | metacharacter match any one 
of the items. For example, the following: 


Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
if( Salphabet =~ /a(b|cld)c/ ) { 
print $&; 
} 
prints as: 
abc. 
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The example also shows how parentheses group things in a regular expression. The 
parentheses are metacharacters that aren't matched in the string; rather, they group the 
alternation, given as b|c|d, meaning any one of b, c, or d at that position in the pattern. 
Since b is actually in $alphabet at that position, the alternation, and indeed the entire pattern 
a(b|c|d)c, matches in the $alphabet. (One additional point: ab|cd means (ab) | (cd), not 
a(b|c)d.) 


The third major idea of regular expressions is repetition (or closure). This is indicated in a 
pattern with the quantifier metacharacter *, sometimes called the Kleene star after one of the 
inventors of regular expressions. When * appears after an item, it means that the item may 
appear O, 1, or any number of times at that place in the string. So, for example, all of the 
following pattern matches will succeed: 





"AC' =~ /AB*C/; 
"ABC! =~ /AB*C/; 
'ABBBBBBBBBBBC! =~ /AB*C/; 























A.16.2 Metacharacters 
The following are metacharacters: 
MG ah. Ae ae AP a 
A.16.2.1 Escaping with \ 


A backslash \ before a metacharacter causes it to match itself; for instance, \ matches a 
single \ in the string. 


A.16.2.2 Alternation with | 

The pipe | indicates alternation, as described previously. 
A.16.2.3 Grouping with ( ) 

The parentheses ( ) provide grouping, as described previously. 
A.16.2.4 Character classes 


Square brackets [ ] specify a character class. A character class matches one character, 
which can be any character specified. For instance, [abc] matches either a, or b, or c at that 
position (so it's the same as a|b/c). A -Z is a range that matches any uppercase letter, a-z 
matches any lowercase letter, and 0-9 matches any digit. For instance, [A-Za-z0-9] 
matches any single letter or digit at that position. If the first character in a character class is *, 
any character except those specified match; for instance, [*0-9] matches any character that 
isn't a digit. 


A.16.2.5 Matching any character with a dot 


The period or dot . represents any character except a newline. (The pattern modifier /s 
makes it also match a newline.) So, . is like a character class that specifies every character. 


A.16.2.6 Beginning and end of strings with * and $ 


The * metacharacter doesn't match a character; rather, it asserts that the item that follows 
must be at the beginning of the string. Similarly, the $ metacharacter doesn't match a 
character but asserts that the item that precedes it must be at the end of the string (or 
before the final newline). For example: /*Watson and Crick/ matches if the string starts 
with Watson and Crick; and /Watson and Crick$/ matches if the string ends with Watson 
and Crick Or Watson and Crick\n. 


A.16.2.7 Quantifiers 


These metacharacters indicate the repetition of an item. The * metacharacter indicates zero, 
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one, or more of the preceding item. The + metacharacter indicates one or more of the 
preceding item. The brace { } metacharacters let you specify exactly the number of previous 
items, or a range. For instance, {3} means exactly three of the preceding item; {3,7} means 
three, four, five, six, or seven of the preceding item; and {3,} means three or more of the 
preceding item. The ? matches none or one of the preceding item. 


A.16.2.8 Making quantifiers match minimally with ? 


The quantifiers just shown are greedy (or maximal) by default, meaning that they match as 
many items as possible. Sometimes, you want a minimal match that will match as few items 
as possible. You get that by following each of * + {} ? with a ?. So, for instance, *? tries to 
match as few as possible, perhaps even none, of the preceding item before it tries to match 
one or more of the preceding item. Here's a maximal match: 





"hear ye hear ye hear ye!' =~ /hear.*ye/; 
print $&; 


This matches 'hear' followed by .* (as many characters as possible), followed by 'ye', and 
prints: 


hear ye hear ye hear y 





Here is a minimal match: 


"hear ye hear ye hear ye' =~ /hear.*?ye/; 
print S$; 





This matches 'hear' followed by .*? (the fewest number of characters possible), followed by ' 
ye', and prints: 


hear ye 


A.16.3 Capturing Matched Patterns 


You can place parentheses around parts of the pattern for which you want to know the 
matched string. Take, for example, the following: 


Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
Salphabet =~ /k(lmnop)q/; 

print $15 

This prints: 

lmnop 


You can place as many pairs of parentheses in a regular expression as you like; Perl 
automatically stores their matched substrings in special variables named $1, $2, and so on. 
The matches are numbered in order of the left-to-right appearance of their opening 
parenthesis. 


Here's a more intricate example of capturing parts of a matched pattern in a string: 





Salphabet = 'abcdefghijklmnopqrstuvwxyz'; 
Salphabet =~ /(((a)b)c)/; 

print "First pattern = ", $1,"\n"; 

print "Second pattern = ", $2,"\n"; 

print "Thard pattern = ", $3,"\n"; 

This prints: 

First pattern = abc 

Second pattern = ab 


Third pattern = a 


A.16.4 Metasymbols 


Metasymbols are sequences of two or more characters consisting of backslashes before 
normal characters. These metasymbols have special meanings in Perl regular expressions 
(and in double-quoted strings for most of them). There are quite a few of them, but that's 


Page 349 





because they're so useful. Table A-3 lists most of these metasymbols. The column "Atomic" 
indicates Yes if the metasymbol matches an item, No if the metasymbol just makes an 
assertion, and - if it takes some other action. 


Table A-3. Alphanumeric metasymbols 




















Symbol Atomic Meaning 
\0 Yes Match the null character (ASCII NULL) 
\NNN Yes Match the character given in octal, up to 377 
\n Yes Match nth previously captured string (decimal) 
\a Yes Match the alarm character (BEL) 
\A No True at the beginning of a string 
\b Yes Match the backspace character (BS) 
\b No True at word boundary 
\B No True when not at word boundary 
\cX Yes Match the control character Control-X 
\d Yes Match any digit character 
\D Yes Match any nondigit character 
\e Yes Match the escape character (ASCII ESC, not backslash) 
\E - End case (\L, \U) or metaquote (\Q) translation 
\f Yes Match the formfeed character (FF) 
\G No True at end-of-match position of prior m//g 
\l - Lowercase the next character only 
\L - Lowercase till \E 
\n Yes Match the newline character (usually NL, but CR on Macs) 
\Q - Quote (do-meta) metacharacters till \E 
\r Yes Match the return character (usually CR, but NL on Macs) 
\s Yes Match any whitespace character 
\S Yes Match any nonwhitespace character 








Page 350 


Symbol 
\t 
\u 
\U 
\w 
\W 
\x{abcd} 
\Z 


\Z 





Table A-3. Alphanumeric metasymbols 


Atomic 


Yes 


Yes 


Yes 


Yes 


No 


No 








Meaning 
Match the tab character (HT) 
Titlecase the next character only 
Uppercase (not titlecase) till \E 
Match any "word" character (alphanumerics plus _ ) 
Match any nonword character 
Match the character given in hexadecimal 
True at end of string only 


True at end of string or before optional newline 


A.16.5 Extending Regular-Expression Sequences 


Table A-4 includes several useful features that have been added to Perl's regular-expression 


Table A-4. Extended regular-expression sequences 


capabilities. 
Extension 
(?#...) 
C2 besa) 


(?imsx-imsx) 


(?imsx-imsx:...) 


No Comment, discard 

Yes Cluster-only parentheses, no capturing 
No Enable/disable pattern modifiers 

Yes Cluster-only parentheses plus modifiers 














(?=...) True if lookahead assertion succeeds 
(?!...) True if lookahead assertion fails 
(?<=...) No True if lookbehind assertion succeeds 
(2<!...) No True if lookbehind assertion fails 
(?>...) Yes Match nonbacktracking subpattern 
(2{...}) Execute embedded Perl code 

(224... }) Match regex from embedded Perl code 
C2 (sea) otelless) Yes Match with if-then-else pattern 
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Table A-4. Extended regular-expression sequences 





A.16.6 Pattern Modifiers 


Pattern modifiers are single-letter commands placed after the forward slashes. They delimit a 
regular expression or a substitution and change the behavior of some regular-expression 
features. Table A-5 lists the most common pattern modifiers, followed by an example. 


Table A-5. Pattern modifiers 


Modifier 


: 
: 
; 


/o Compile pattern once only 








/g Find all matches, not just the first one 








As an example, say you were looking for a name in text, but you didn't know if the name had 
an initial capital letter or was all capitalized. You can use the /i modifier, like so: 


Stext = "WATSON and CRICK won the Nobel Prize"; 
Stext =~ /Watson/i; 
print $&; 


This matches (since /i causes upper- and lowercase distinctions to be ignored) and prints out 
the matched string WATSON. 
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A.17 Scalar and List Context 


Every operation in Perl is evaluated in either scalar or list context. Many operators behave 
differently depending on the context they are in, returning a list in list context and a scalar in 
scalar context. 


The simplest example of scalar and list contexts is the assignment statement. If the left side 
(the variable being assigned a value) is a scalar variable, the right side (the values being 
assigned) are evaluated in scalar context. In the following examples, the right side is an array 
@array of two elements. When the left side is a scalar variable, it causes @array to be 
evaluated in scalar context. In scalar context, an array returns the number of elements in an 
array: 


@array = ('one', 'two'); 
Sa = @array; 

print Say 

This prints: 

2 


If you put parentheses around the $a, you make it a list with one element, which causes 
@array to be evaluated in list context: 


@array = ('one', 'two'); 
(Sa) = @array; 

print $a; 

This prints: 

one 


Notice that when assigning to a list, if there are not enough variables for all the values, the 
extra values are simply discarded. To capture all the variables, you'd do this: 


@array = ('one', 'two'); 
(Sa, Sb) = @array; 

print "Sa Sb"; 

This prints: 

one two 


Similarly, if you have too many variables on the left for the number of right variables, the 
extra variables are assigned the undefined value undef. 


When reading about Perl functions and operations, notice what the documentation has to say 
about scalar and list context. Very often, if your program is behaving strangely, it's because it 
is evaluating in a different context than you had thought. 


Here are some general guidelines on when to expect scalar or list context: 


e You get list context from function calls (anything in the argument position is evaluated 
in list context) and from list assignments. 


e You get scalar context from string and number operators (arguments to such 
operators as . and + are assumed to be scalars); from boolean tests such as the 
conditional of an if () statement or the arguments to the || logical operator; and 
from scalar assignment. 
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A.18 Subroutines 


Subroutines are defined by the keyword sub, followed by the name of the subroutine, 
followed by a block enclosed by curly braces containing the body of the subroutine. The 
following is a simple example. 


sub a_subroutine { 
print "I'm in a subroutine\n"; 


} 


In general, you can call subroutines using the name of the subroutine followed by a 
parenthesized list of arguments: 


a_subroutine(); 


Arguments can be passed into subroutines as a list of scalars. If any arrays are given as 
arguments, their elements are interpolated into the list of scalars. The subroutine receives all 
scalar values as a list in the special variable @_. This example illustrates a subroutine definition 
and the calling of the subroutine with some arguments: 


sub concatenate _dna { 
my($dnal, $dna2) = @ ; 


my (Sconcatenation) ; 
Sconcatenation = "Sdnal$dna2"; 


return S$concatenation; 


} 


print concatenate dna('AAA', 'CGC'); 
This prints: 
AAACGC 


The arguments 'AAA' and 'CGC' are passed into the subroutine as a list of scalars. The first 
statement in the subroutine's block: 


my($dnal, S$dna2) = @ ; 


assigns this list, available in the special variable @_, to the variables $dnal and $dna2. 


The variables $dnal and Sdna2 are declared as my variables to keep them local to the 
subroutine's block. In general, you declare all variables as my variables; this can be enforced 
by adding the statement use strict; near the beginning of your program. However, it is 
possible to use global variables that are not declared with my, which can be used anywhere in 
a program, including within subroutines. 


The statement: 


my (Sconcatenation) ; 


declares another variable for use by the subroutine. 


After the statement: 


Sconcatenation = "Sdnal$dna2"; 


performs the work of the subroutine, the subroutine defines its value with the return 
statement: 


return Sconcatenation; 


The value returned from a call to a subroutine can be used however you wish; in this 
example, it is given as the argument to the print function. 
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If any arrays are given as arguments, their elements are interpolated into the @_ list, as in the 
following example: 
sub example sub { 

my(@arguments) = @ ; 


print "@arguments\n"; 


} 
my @array = (‘two', “three', ‘four'); 


example sub(‘one', @array, ~“five'); 
This prints: 
one two three four five 


Note that the following attempt to mix arrays and scalars in the arguments to a subroutine 
won't work: 


# This won't work!! 
sub bad_sub { 
my(@array, S$scalar) = @ ; 


print Sscalar; 


} 


my @arr = ('DNA', 'RNA'); 
my Sstring = 'Protein'; 
bad_sub(@arr, $string); 


In this example, the subroutine's variable @array on the left side in the assignment statement 
consumes the entire list on the right side in @_, namely ('DNA’, 'RNA', 'Protein'). The 
subroutine's variable $scalar won't be set, so the subroutine won't print 'Protein' as 
intended. To pass separate arrays and hashes to a subroutine, you need to use references. 
Here's a brief example: 


sub good_sub { 
my (Sarrayref, Shashref) = @ ; 


print "@Sarrayref", "\n"; 


my @keys = keys %Shashref; 


print "@keys", "\n"; 
} 


my @arr = ('DNA', 'RNA'); 
my snums = ( 'one' => 1, 'two!' => 2); 


good_sub(\@arr, \%nums) ; 


This prints: 


4 PREVIOUS 
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A.19 Modules and Packages 


Chapter 1 summarizes the basic concepts of namespaces, packages, and modules. 


A namespace is a table containing the names and values of variables and subroutines. 
Namespaces are excellent ways to protect one part of your code from unintentionally using 
the same variable or subroutine name as appears in another part of your code, causing a 
namespace collision and often leading to incorrect (but hard to identify) program behavior. 


By default, a Perl program uses the namespace called main. 


The package declaration enables you to declare and use different namespaces for different 
parts of your program. For instance, to declare a new namespace called Outer, use the 
following statement: 


package Outer; 


Package declarations usually occur at or near the top of a file and are in effect throughout the 
file, but they can appear several times within a file, causing the active namespace to switch 
each time they are called. 


When a file has one package declaration at the top of the file and it's named with the package 
name followed by the .pm suffix (e.g, Outer.pm), the file is called a Perl module. (The module 
also needs to end with the statement "1;" to load correctly when called.) 


The code for a Perl module can be used in a Perl program by referencing the file defining the 
module with a use statement, as in the following example: 

use Outer; 

The Perl interpreter will then try to find a file called Outer.pm. 

Chapter 1 gives the basic details on how to manage modules so that the Perl interpreter can 
find them. 
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A.20 Object-Oriented Programming 


Object-oriented Perl programming is introduced in Chapter 3 and Chapter 4, and used in all 
the remaining chapters of the book. 





There are three main concepts: 


e A class is a Perl module with a package declaration. The name of the class is the same 
as the name of the package; the module file also has the same name but with ".pm" 
(for Perl module) appended. 


e An object is a reference to a collection of data used by a Perl class, usually but not 
necessarily, implemented as a hash. An object is marked with the name of its class 
using bless. 


e A method is a subroutine in the class. 


Arrow notation -> is used in a special way in object-oriented Perl code and has an effect on 
what arguments are passed to the class methods. 
You use a class in your code by saying, for example: 


use Goodclass; 


You create an object by calling a constructor method in the class, which is usually (but not 
necessarily) called new, for example: 


Sgoodobject = Goodclass->new( parameterl => 1, parameter2 => 2); 


Note that arguments are usually (but not necessarily) specified using the hash notation key 
=> value, and therefore can be given in any order. 


You call other methods in the class by invoking them from a class object. For example you 
call the method stuff in class Goodclass On a Goodclass object $goodobject, with 
argument type initialized to the value 'good', and save its output in the array @goodstuff, like 
so: 


@goodstuff = S$goodobject->stuff (type => 'good'); 


The arrow notation causes the called method to insert an additional argument. (In the 
examples just shown, the methods are the subroutines new and stuff). 


The additional argument is the class information that appears to the left of the arrow. So, in 
these examples, the method new has the argument list: 

('Goodclass', parameterl=1, parameter2=2) 

The method stuff has the argument list: 

(Sgoodobject, type='good') 


The details of how arguments are passed to methods are important to know only if you are 
writing a class, not if you are using one. If you simply use a class, you only have to know how 
to call the class methods using the arrow notation as just shown. 


Inheritance is a technique with which you can write a new class that uses definitions from an 
already existing class. 





AUTOLOAD is a Subroutine that handles calls to undefined methods, and DESTROY is a 
subroutine that handles cleanup when a class object goes out of scope. 


In addition to object methods that operate on individual class objects, you can also define 
class methods that have access to information about multiple class objectsfor example, a 
count of how many objects have been created. 
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A great deal of the utility of existing Perl code is available only in class modules. In this book I 
use some of the important and well-known modules including DBI, CGI, GD and GD: :Graph, 


and the Bioperl collection of modules. Almost all modules are available at 
http://www.CPAN.org. 
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A.21 Built-in Functions 


Perl has many built-in functions. Table A-6 is a partial list with short descriptions. 


Table A-6. Perl built-in functions 


abs VALUE Return the absolute value of its numeric argument 


atan2 Y, X Return the principal value of the arc tangent of Y/X from -x to x 

chdir EXPR Change the working directory to EXPR (or home directory by 
default) 

chmod MODE LIST Change the file permissions of the LIST of files to MODE 





chomp (VARIABLE or 
LIST) 


chop (VARIABLE or LIST) |Remove ending character from string(s) 
chown UID, GID, LIST Change owner and group of LIST of files to numeric UID and GID 
close FILEHANDLE Close the file, socket, or pipe associated with FILEHANDLE 


closedir DIRHANDLE Close the directory associated with DIRHANDLE 


Remove ending newline from string(s), if present 








cos EXPR Return the cosine of the radian number EXPR 


dbmclose HASH Break the binding between a DBM file and a hash 


dbmopen HASH, DBNAME 
MODE 


defined EXPR Return true or false if EXPR has a defined value or not 


delete EXPR Delete an element (or slice) from a hash or an array 









‘| Bind a DBM file to a HASH with permissions given in MODE 





die LIST Exit the program with an error message that includes LIST 





each HASH Step through a hash with one key, or key/value pair, at a time 


exec PATHNAME LIST Terminate the program and execute the program PATHNAME 





with arguments LIST 


exists EXPR Return true if hash key or array index exists 


exit EXPR Exit the program with the return value of EXPR 
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Function 
exp EXPR 
format 


grep EXPR, LIST 


Table A-6. Perl built-in functions 


Return the value of e raised to the exponent EXPR 
Declare a format for use by the write function 


Return list of elements of LIST for which EXPR is true 





gmtime 


goto LABEL 


hex EXPR 


Get Greenwich mean time; Sunday is day 0, January is month 0, 


year is number of years since 1900example: 


($sec,$min,$hour,$mday,$mon,$year,$wday,$yday, 
$isdaylightsavingstime) = gmtime; 


Program control goes to statement marked with LABEL 


Return decimal value of hexadecimal EXPR 








index STR, SUBSTR 
int EXPR 


join EXPR, LIST 


Give the position of the first occurrence of SUBSTR in STR 


Give the integer portion of the number in EXPR 
Join the strings in LIST into a single string, separated by EXPR 

















keys HASH Return a list of all the keys in HASH 

last LABEL Exit the immediately enclosing loop by default, or loop with 
LABEL 

Ic EXPR Return a lowercased copy of string in EXPR 

Icfirst EXPR Return a copy of EXPR with first character lowercased 

length EXPR Return the length in characters of EXPR 

localtime Get local time in same format as in gmtime function 

log EXPR Return natural logarithm of number EXPR 

m/PATTERN/ The match operator for the regular-expression PATTERN, often 


abbreviated as /PATTERN/ 


map BLOCK LIST (or map | Evaluate BLOCK or EXPR for each element of LIST, return list of 


EXPR, LIST) 
mkdir FILENAME 


my EXPR 


next LABEL 


return values 


Create the directory FILENAME 
Localize the variables in EXPR to the enclosing block 


Go to next iteration of enclosing loop by default or to loop 





marked with LABEL 
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Table A-6. Perl built-in functions 


Open a file by associating FILEHANDLE with the file and options 


Function Summary 


oct EXPR Return decimal value of octal value in EXPR 


open FILEHANDLE, EXPR given in EXPR 





opendir DIRHANDLE, EXPR | Open the directory EXPR and assign handle DIRHANDLE 


pop ARRAY Remove and return the last element of ARRAY 





pos SCALAR Give location in string SCALAR where last m//g search left off 





print FILEHANDLE LIST Print LIST of strings to FILEHANDLE (default STDOUT) 


printf FILEHANDLE Print string specified by FORMAT and variables LIST to 
FORMAT, LIST FILEHANDLE 





push ARRAY, LIST Place the elements of LIST at the end of ARRAY 


Give pseudorandom decimal number from 0 to less than EXPR 








rand EXPR (default 1) 
readdir DIRHANDLE Return list of entries of directory DIRHANDLE 
ref EXPR Return true or false if EXPR is a reference or not: if true, returned 


value indicates type of reference 


rename OLDNAME, Change the name of a file 








NEWNAME 
return EXPR Return from the current subroutine with value EXPR 
reverse LIST Give LIST in reverse order, or reverse strings in scalar context 


Like the index function but returns last occurrence of SUBSTR in 


rindex STR, SUBSTR STR 





rmdir FILENAME Delete the directory FILENAME 


s/PATTERN/REPLACEMEN |Replace the match of regular-expression PATTERN with string 
T/ REPLACEMENT 





scalar EXPR Force EXPR to be evaluated in scalar context 


Position the file pointer for FILEHANDLE to OFFSET bytes (if 
WHENCE is O, current position plus OFFSET if WHENCE is 1, or 
OFFSET bytes from the end if WHENCE is 2) 


seek FILEHANDLE, 
OFFSET, WHENCE 
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Table A-6. Perl built-in functions 


Function 
shift ARRAY 
sin EXPR 


sleep EXPR 


Summary 
Remove and return the first element of ARRAY 


Return the sine of the radian number EXPR 


Cause the program to sleep for EXPR seconds 





sort USERSUB LIST (or 
sort BLOCK LIST) 


Sort the LIST according to the order in USERSUB or BLOCK 
(default standard string order) 





splice ARRAY, OFFSET, 
LENGTH, LIST 


split /PATTERN/, EXPR 
sprintf FORMAT, LIST 


sqrt EXPR 


srand EXPR 


Remove LENGTH elements at OFFSET in ARRAY and replace with 
LIST, if present 





Split the string EXPR at occurrences of /PATTERN/, return list 
Return a string formatted as in the printf function 


Return the square root of the number EXPR. 


Set random number seed for rand operator; only needed in 
versions of Perl before 5.004 





stat (FILEHANDLE or 
EXPR) 


study SCALAR 


sub NAME BLOCK 


Return statistics on file EXPR or its FILEHANDLEexample: 


($dev,$inode,$mode,$num_of_links,$uid,$gid,$rdev,$size,$acce 
sstime, $modifiedtime,$changetime,$blksize,$blocks) = stat 
$filename; 





Try to optimize subsequent pattern matches on string SCALAR 


Define a subroutine named NAME with program code in BLOCK 








substr EXPR, OFFSET, 
LENGTH,REPLACEMENT 


system PATHNAME LIST 


tell FILEHANDLE 


tr/ORIGINAL/REPLACEME 


NT/ 


truncate (FILEHANDLE or 


EXPR), LENGTH 


uc EXPR 


Return substring of string EXPR at position OFFSET and length 
LENGTH; the substring is replaced with REPLACEMENT if used 


Execute any program PATHNAME with arguments LIST; returns 
exit status of program, not its output; to capture ouput, use 
backticksexample: 


@output = '/bin/who'; 





Return current file position in bytes in FILEHANDLE 


Transliterates each character in ORIGINAL with corresponding 
character in REPLACEMENT 





Shorten file EXPR or opened with FILEHANDLE to LENGTH bytes 





Return uppercased version of string EXPR 


Page 362 


Function 


ucfirst EXPR 


undef EXPR 


unlink LIST 


unshift ARRAY, LIST 


Table A-6. Perl built-in functions 


Return the undefined value; if a defined variable or subroutine 


Summary 


Return string EXPR with first character capitalized 


EXPR is given, it's no longer defined; it can be assigned a value 
when you don't need to save the value 





Delete the LIST of files 


Add LIST elements to the beginning of ARRAY 





use MODULE 


values HASH 


wantarray 


warn LIST 


write FILEHANDLE 





Load the MODULE 





Return a list of all values of the HASH 


In a subroutine, return true if calling program expects a list return 





value 


Print error message including LIST 


Write formatted record to FILEHANDLE (default STDOUT) as 
defined by the format function 
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Appendix B. Installing Perl 


Perl is a popular programming language that is extensively used in areas such as 
bioinformatics and web programming. It's become popular with biologists because it is so well 
suited to several bioinformatics tasks. This appendix is geared to those who are attempting 
the language for the first time. 


Perl is also an application and is available (at no cost) to run on all operating systems found in 
the average biology lab (Unix and Linux, Macintosh, Windows, VMS, and more). The Perl 
application on your computer takes a Perl language program (such as one of the programs in 
this book), translates it into instructions the computer can understand, and runs (or 
executes) it. 


The word Perl, then, refers both to the language in which you write programs and to the 
application on your computer that runs those programs. You can always tell from context 
which of these two meanings is being used. 


Every computer language such as Perl needs to have a translator application (an interpreter 
or compiler) that can turn programs into instructions the computer can actually run. The Perl 
application is often referred to as the Perl interpreter, and it includes a Perl compiler as well. 
You will also see Perl programs referred to as Perl scripts or Perl code. 
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B.1 Installing Perl on Your Computer 
Here are the basic steps for installing Perl on your computer: 
1. Check to see if Perl is already installed; if so, check the version. 


2. Get Internet access and go to the Perl home page at http://www.perl.com. 





3. Go to the Downloads page and determine which Perl distribution to download. 
4. Download the Perl distribution. 


5. Install the distribution on your computer. 


B.1.1 Perl May Already Be Installed 


Many computersespecially Unix and Linux computerscome with Perl already installed. (Note 
that Unix and Linux are essentially the same thing, as far as the operating system is 
concerned; Linux is a clone, or functional copy, of a Unix system.) So, first check to see if Perl 
is already there. On Unix and Linux, type the following at a command prompt: 


S perl -v 
If Perl is already installed, you'll see a message something like this: 
This is perl, v5.8.0 built for 1686-linux 


Copyright 1987-2002, Larry Wall 


Perl may be copied only under the terms of either the Artistic License or the 
GNU General Public License, which may be found in the Perl 5 source kit. 


Complete documentation for Perl, including FAQ lists, should be found on 

this system using 'man perl' or 'perldoc perl'. If you have access to the 
Internet, point your browser at http://www.perl.com/, the Perl Home Page. 

If Perl isn't installed, you'll get a message something like this: 

perl: command not found 

If you're on a shared Unix system, at a university or business, check with the system 
administrator if this fails, because although Perl may be installed, your environment may not 
be set to find it. (Or, the system administrator may say, "You need Perl? Okay, I'll install it for 
you!") 


On Windows or Macintosh, look at the program menus, or use the find program to search 
for perl. You can also try typing perl -v at an MS-DOS command window or at a shell 
window on the Mac OS X. (Note that the Mac OS X is a Unix system!) 
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B.2 Versions of Perl 


Perl, like almost all popular software, has gone through much growth and change over the 
course of its 15-year life. The authors, Larry Wall and a large group of cohorts, publish new 
versions periodically. These new versions have been carefully designed to support most 
programs written under old versions, but occasionally some major new features are added 
that simply won't work with older versions of Perl. 


This book assumes you have Perl Version 5 or higher installed. It's likely that if you have Perl 
already installed on your computer, it's Perl 5. But it's best to check. On a Unix or Linux 
system, or from an MS-DOS or Mac OS X command window, the perl -v command just 
shown displays the version number, in my case Version 5.8.0. The number 5.8.0 is "bigger" 
than 5, so I'm okay. If you get a smaller number (very likely 4.036), you'll have to install a 
recent version of Perl to enable the majority of programs in this book to run as shown. 


What about future versions? Perl is always evolving, and Perl Version 6 is on the horizon. Will 
the code in this book still work in Perl 6? The answer is yes. Although Perl 6 is going to add 
some new things to the language, it should have no trouble with the Perl 5 code in this book. 
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B.3 Internet Access 


To install Perl, you will need to download it from the Internet, as shown in the next few pages. 


If you don't have Internet access, you can try taking your computer to a friend who does 
have such access, and connecting long enough to install Perl. You can also use a Zip drive or 
burn a CD from a friend's computer to bring the Perl software to your computer. There are 
also commercial shrink-wrapped CDs of Perl available from several sources (ask at your local 
software store), and several books include CDs with Perl. 
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B.4 Downloading 


Following are some directions for getting Perl to work on your Unix or Linux, Macintosh, and 
Windows computers. (It is also available for several other less common operating systems 
from the Perl web site.) 


There is one web site that serves as a central jumping off point for all things Perl: 
http://www.perl.com. This web page has a Downloads clickable button that guides you to 
everything you need to install Perl on your computer. 





The more specific directions that follow are up-to-date as of this writing, but you should be 
aware that the web pages they refer to may change their design. So, basically, go to 

http ://www.perl.com, and look for the Downloads link to click on. The system should tell you 
everything you need to know after that. There will be a HELP link and other helpful links 
besides. So even if the information in this section becomes outdated, you will certainly be able 
to visit the main Perl web site and find all you need to install Perl. 





Downloading and installing Perl is usually quite easy; in fact, the majority of the time it's 
perfectly painless, but sometimes you may have to put some effort into getting it to work. If 
you're new at programming, and you run into difficulties, the best thing to do is to ask for 
help from someone who is a professional computer programmer and/or administrator, 
teacher, or someone in the lab who already programs in Perl. But the chances are that you 
won't need any help: keep reading! 


B.4.1 Binary Versus Source Code 


One choice you will find when downloading is the choice between binary or source code 
distributions of Perl. The chances are very good that a binary version will be available for your 
computer. Get that if it's available for your particular system. 


Recall that a binary (or executable or compiled program) is a program that has been 
translated into machine language and is ready to run. Source code is a program written in a 
language such as C or Perl, which can then be compiled or interpreted to become a binary. 
The Perl application is actually written in the C programming language, so it's also possible to 
get the C source code for the Perl application and compile it to create the Perl application 
binary. 


The best choice for installing Perl on your computer is usually to get an already made binary 
version of the program, because nothing else needs to be done. However, if no binary is 
available, or if you want to control the various options of your Perl installation, you can get 
the source code for Perl, which is itself written in the C programming language. You then 
compile it using a C compiler. But I repeat: see if you can find a binary for your particular 
computer operating system; compiling from source code is more complicated for beginners! 
Details are available at the Perl web site. 


B.4.2 Perl for Unix and Linux 


Recall that Unix and Linux are essentially the same kind of operating systemLinux is a clone of 
Unix. Both Unix and Linux come in several variants offered by various companies. 


Perl was originally developed on Unix and for quite some time now it has come already 
installed on most such systems. Open a window and type per! -v. If you get version 
information, Perl is there. 


If Perl isn't installed on your Unix or Linux machine, first try to find a binary to install. Go to 
the Downloads page of http://www.perl.com. You'll see the subheading Binary Distributions. 
Select Unix or Linux, and then see if your particular flavor of system has a binary available. 
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Several versions of Unix and Linux are available here, and the instructions available at the web 
site should be enough for you to get Perl installed once you've downloaded the binary. Most 
versions of Linux maintain up-to-date Perl binaries on their web sites. For instance, if you 
have a Redhat Linux system, you need to identify which version of the system you have (by 
typing uname -a); then get the appropriate rpm file to download and install. Redhat has an rpm 
for Perl that Redhat Linux users can install by typing: 


rpm -Uvh perl.rpm 


Note that the actual name of the per1.rpn file varies. 


If no binary version of Perl is available for your flavor of Unix or Linux, it is necessary to 
compile Perl from its source code. In this case, starting from the main web page 
http://www.perl.com, you click on the Downloads button and then select Source Code 
Distribution. The source code has an INSTALL file that has instructions that will guide you 
through downloading the source code, installing it on your system, compiling the source code 
into a binary, and finally installing the binary. 





Compiling from source code is a considerably more lengthy process than installing an already 
made binary and requires a bit more reading of instructions, but it usually works quite well. 
Occasionally, however, a problem does arise, and if you are a beginner, I strongly suggest 
that you search out an experienced programmer or system administrator who knows Unix, 
Linux, or Perl, as it's an easy task for the experienced, but can be a bit confusing the first time 
around for the beginner. You need a C compiler on your computer to install Perl from the 
source code. Nowadays, some Unix systems ship without a complete C compiler! Linux will 
always have the free C compiler called gcc installed, and you can also install gcc on any Unix 
(or Windows, or Mac) system that lacks a C compiler. 


B.4.3 Perl for Macintosh 


The MacOS X is a Unix-based system. It ships with Perl installed. But you may be able to find 
more recent versions by looking at the Downloads page, then at Binary Distributions, then at 
Unix or Linux, and go to the Mac OS X link. 


MacPerl is a port of Perl to pre-OS X Macintosh operating systems. The MacPerl installation 
steps are clearly explained on the MacPerl web page, http://www.macperl.com (which you 
can also find from the Perl web page http://www.perl.com and its Downloads button). I'll give 
a brief overview of the process here. 








From the MacPerl page, click on Get MacPerl, and follow the directions to download the 
application. It will appear on your desktop. Double-click it to unstuff it. If you don't have 
Alladin Stuffit Expander (most Macs already do) this won't work, and you'll first have to go to 
http://www.alladinsys.com to download and install Stuffit. 





MacPerl can be installed as a standalone application under the Mac OS Finder or as a tool 
under the Macintosh Programmer's Workbench; most users will want the standalone 
application. Perl Version 5 is available for Mac OS 7.0 and later. More details about which Perl 
version is available for your particular hardware and Mac OS version are available at the 
MacPerl web page. 


B.4.4 Perl for Windows 


Several binaries for different Windows versions are available. Since Windows is closely coupled 
with Intel 32-bit chips, these binaries are often called Wintel or Win32 binaries. The standard 
Perl distribution now is ActivePerl from ActiveState, at the web address 

http://www. activestate.com/ActivePerl/ and the complete installation directions are there on 
the web page. You can also get to that web page via the Downloads button from the Perl 
web site http://www.perl.com. You'll then see the subheading Binary Distributions. Under 
that, click on Win32. Then, click on the ActivePerl site. 








From the ActiveState web site's ActivePerl page, click the Downloads button. Then you can 
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download the Windows-Intel binary. Note that installing it requires a program called Windows 
Installer which is available at that same web page if it's not already on your computer. 
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B.5 How to Run Perl Programs 


The details of how to run Perl vary depending on the operating system of your computer. The 
instructions that come with the version of Perl you install on your computer contain all you 
need to know. I'll just give a short summary here that will point you in the right direction. 


There is a part of the Perl documentation called perlrun. You can find it at 


http://www.perl.com/pub/doc/manual/html/pod/perlrun.html. It gives all the options of 


running Perl, especially on Unix and Linux; it's complete but not for the beginner. 
B.5.1 Running Perl Programs on Unix or Linux 
On Unix or Linux, you usually run Perl programs from the command line. You can run a Perl 
program in a file called this program by typing: 
perl this program 


if you're in the same directory as the program. If you're not in the same directory, you may 
have to give the pathname of the program: 





perl /usr/local/bin/this program 


Usually, you set the first line of this_program to have the correct pathname for Perl on your 
system, since different machines may have installed Perl in different directories. On my 
computer, I use the following as the first line of my Perl programs: 

#!/usr/bin/perl 


You can type which per! to find the pathname where Perl is installed on your system. 


You also usually make the program executable, using the chmod program: 
chmod 755 this program 


If you've set the first line correctly and used chmod, you can just type the name of the Perl 
program to run it. So, if you're in the same directory as the program, you can type 
./this_program or, if the program is in a directory that's included in your $PATH or Spath 
variable, you can type this_program.™ 


"] $PATH is the variable used for the shells sh, bash, and ksh; Spath is the variable used for csh, tcsh, and 
so on. 


If your Perl program won't run, the error messages you get from the shell in the command 
window may be a little confusing. For instance, the bash shell on my Linux system gives the 
error message: 


bash: ./my program: No such file or directory 


in two cases: if there really is no program called my program in the current directory, or if the 
first line of my program has incorrectly given the location of Perl. So watch for that, especially 
when running programs from CPAN that may have different pathnames for Perl embedded in 
their first lines. Also, if you type my_program, you may get the error message: 


bash: my program: command not found 

which means that the operating system can't find the program. But it's right there in your 
directory! The problem is probably that your SPATH or $path variable doesn't include the 
current directory, so that the system isn't even looking in the current directory for the 
program. In this case, change the $PATH or $path variable (depending on which shell you're 
using); or just type ./my_program instead of my_ program. 


B.5.2 Running Perl Programs on the Macintosh 


On Macs, the recommended way to save Perl programs is as "droplets." The MacPerl 
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documentation gives the simple instructions. Basically, you open the Perl program with the 
MacPerl application and then choose Save As and select the Type option Droplet. 


You can drag and drop a file onto a droplet in order to use the file as input (via the @ARGV 
array). 


The new Mac OS X is a Unix system, and on those systems, you have the option of running 
Perl programs from the command line as just described for Unix and Linux systems. 


B.5.3 Running Perl Programs on Windows 


On Windows computers, it's usual to associate the filename extension .p1 with Perl 
programs. This is usually done as part of the installation process, which modifies the registry 
settings to include this file association. You can then launch this _program.p1 by typing 
this_program or perl this_program.p/ in an MS-DOS command window. Windows also has a 
PATH variable specifying folders in which the system looks for programs; this is modified by 
the Perl installation process to include the path to the folder for the Perl application, usually 
c:\perl. If you're trying to run a Perl program that isn't installed in a folder known to the 
PATH variable, you can type the complete pathname to the program, e.g., perl 
c:\windows\desktop\my program.pl. 
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B.6 Finding Help 


Make sure you have any available documentation. If you install Perl as outlined earlier, this is 
already done as part of the general Perl installation, and the instructions that come with your 
Perl distribution will explain how to get at the documentation on your system. There is also 
excellent online Internet documentation, which can be found at http://www.perl.com. 





First, point your web browser at the main Perl web site, bookmark it, and look around a little, 
especially at the available documentation. This is an essential resource. Also, point your 
browser at http://www.ncbi.nlm.nih.gov (the National Center for Biotechnology Information) 
and http://www.ebi.ac.uk/ (the European Bioinformatics Institute) for two of the biggest 
government-sponsored bioinformatics resources. These are among the most important web 
sites for Perl and bioinformatics. 








Also very useful is the standard book Programming Perl (now in its third edition). You can do 
fine with the (free) online documentation, but if you end up doing a lot of Perl programming, 
Programming Perl (and perhaps a few others) will probably end up on your bookshelf. 


Most languages have a standard document set that includes the whole story about the 
language definition and use. In Perl, this is included with the program as the on-line manual. 
Although programming manuals often suffer from poor writing, it is best to be prepared to dig 
into them. A well-honed ability to skim is a great asset. The Perl manual isn't bad; its main 
problem, that it shares with most manuals, is that all the details are in there, so it can be a bit 
overwhelming at first. However, the Perl documentation does a decent job of helping the 
beginner navigate, by means of tutorial documents. 
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Colophon 


Our look is the result of reader comments, our own experimentation, and feedback from 
distribution channels. Distinctive covers complement our distinctive approach to technical 
topics, breathing personality and life into potentially dry subjects. 


The animal on the cover of Mastering Perl for Bioinformatics is a North American bullfrog ( 
Rana catesbeiana). It is native to the central and eastern United States, as well as southern 
portions of Canada. However, this bullfrog has since been introduced as far away from its 
native habitat as Asia, Europe, and Hawaii. 


The North American bullfrog is the largest true frog in North America and can weigh over a 
pound. It can grow up to eight inches in length, although the norm is four to five inches. The 
gender of the bullfrog is ascertained by comparing the size of the external ear (the 
tympanum) relative to the size of the eye. 


Bullfrogs are predators and also are cannibalistic. Their role in the environment is to control 
the population of insect pests as well as snakes and mice. In fact, the zeal of the North 
American bullfrog threatens to drive other frog species to extinction. 


The generic name (rana) comes from the Latin for frog, while the species name (catesbeiana) 
honors an English naturalist. In the 18th century, Mark Catesby (1683-1749) produced the 
authoritative and exhaustive record of the flora and fauna found in the New World. Wealthy 
patrons in England eagerly received Catesby's regular shipments of specimens, including 
plants, birds, reptiles, insects, and frogs. 


Mary Anne Weeks Mayo was the production editor and copyeditor, and Marlowe Shaeffer was 
the proofreader, for Mastering Perl for Bioinformatics. Jane Ellin and Colleen Gorman provided 
quality control. Marlowe Shaeffer, Mary Agner, and James Quill provided production 
assistance. John Bickelhaupt wrote the index. 


Ellie Volckhausen designed the cover of this book, based on a series design by Edie Freedman. 
The cover image is a 19th-century engraving from the Dover Pictorial Archive. Emma Colby 
produced the cover layout with QuarkXPress 4.1 using Adobe's ITC Garamond font. 


David Futato designed the interior layout. This book was converted by Andrew Savikas to 
FrameMaker 5.5.6 with a format conversion tool created by Erik Ray, Jason McIntosh, Neil 
Walls, and Mike Sierra that uses Perl and XML technologies. The text font is Linotype Birka; 
the heading font is Adobe Myriad Condensed; and the code font is LucasFont's TheSans Mono 
Condensed. The illustrations that appear in the book were produced by Robert Romano and 
Jessamyn Read using Macromedia FreeHand 9 and Adobe Photoshop 6. The tip and warning 
icons were drawn by Christopher Bing. This colophon was written by Reg Aubry. 


The online edition of this book was created by the Safari production group (John Chodacki, 
Becki Maisch, and Madeleine Newell) using a set of Frame-to-XML conversion and cleanup 
tools written and maintained by Erik Ray, Benn Salter, John Chodacki, and Jeff Liggett. 
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& (ampersand) 
&& (logical and) operator 
(bitwise and) operator 
< and > (angle brackets) 
> (arrow operator) 
>> right shift operator 
<> _line input operator 
<<_left shift operator 
* (asterisk) quantifier 
@ (at sign) 
@INC array 
-- (autodecrement operator) 
\ (backslash) 
escaping metacharacters 
metasymbols, use in 
\@ (backslash-at) 
! (bang) 
! (logical negation) operator 
!~ (binding) operator 
#1! (shebang) notation 
!~ (binding operators) 
“ (caret) 


metacharacter in regular expressions 


















































/i (case-insensitive) matching 
:// (colon-slashes) 
{} (curly braces) 2nd 
{3 (curly braces) quantifier 
{} (curly braces) 
dereferencing and 
-w_ flag 
$flag variable 
$ (dollar sign) 
$_ variables 
metacharacter 
. (dot) 
character wildcard 
current directory) 
string operator 
:: (double colons) in module names 
= (equal sign) 
=~ (pattern binding) operator 
=> (syntactic sugar symbol) 
/ (forward slash) 
() (parentheses) 
%genetic_code hash 
% (percent sign) 
++ (autoincrement) operator 
+ (plus sign) quantifier 
? (question mark), in quantifiers 
" (quotes, double) in strings 
' (quotes, single) in strings 
3; (semicolon), ending Perl statements 
#! (shebang notation) 
[_] (square brackets) 
dbh 
| (vertical bar) 
(bitwise OR) operator 
|| Gogical or) operator 
alternation 
@ARGV arrays 
“ (caret) 
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bitwise xor operator 
24-bit color 
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abstraction 
accessor methods 
Gene2.pm 
algorithms [See also string algorithms] 
border conditions and 
data structures and 
references 
alternation 
and operator 
bitwise and (&) operator 
logical and 
control flow, using for 
angle operator 
anonymous arrays 
anonymous data 
anonymous hashes 
anonymous referents 
approximate string matching 
arguments 
arithmetic operators 
arrays 2nd 
@ARGV 
anonymous arrays 
of arrays 
elements, specifying 
matrices 
references to 
sparse arrays 
as subroutine arguments 
two-dimensional arrays 
arrow notation 
arrow operator (\>) 
assignment 
scalar and list context 
assignment operators 
at sign (@) 
attributes 2nd 
graphics output, storing in 
key/value pairs 
attributes (databases) 
autoincrement and autodecrement operators 




































































AUTOLOAD subroutine 
accessors 
arguments 
FilelO.pm 
get and set 
mutators 
speeding up the code 
writing methods using 
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backslash-at (\@) 


base 16 (hexadecimal) numbers 





base 8 (octal) numbers 
base class 





bases, changing to reverse complements 





binary (base 2) numbers 





bind variables 
binding operators 
ie 
bioinformatics 
Bioperl 2nd 
bptutorial.pl 
documentation 2nd 
history 
installing 
modules 
object-oriented style 
objects 
problems 
testing 
test 1 
test 2 
test 3 
test 4 
Bioperl modules 2nd 
bitmaps 
bitwise operators 
& (bitwise and) 
bless function 
blocks 
body tags 
border conditions 
Boutell, Thomas 
bptutorial.pl 
browsers 
built-in functions, Perl 























[ Team LiB ] 


Page 379 


[ Team LiB ] 


[SYMBOL] [A] [B] [€] [DB] [E] [E] [G) (4) [1] (J) [K] [L) (M] [N] LO] [P] [Q] CR) (S$) (1) (YU) (Y) (W) [x] 








candidate keys 
capturing in patterns 
caret (%) 
Carp module 2nd 
case-insensitive matching 
CERN 
CGI (Common Gateway Interface) 2nd 
CGI.pm 2nd 
functions 
using 
programs 
checking syntax 
error logs 
installing 
testing 
writing 
scripts 
webrebasel program 
chaining, logical operators 
character classes 
class data 
class inheritance 2nd 
Class: :Struct 
classes 2nd 
base class 
documentation with POD 
example of a Perl class 
using 
client-server architecture 
clone constructor 
close (system call) 
closures 2nd 
CMYK 
codon2aa subroutine 
colon and forward slashes (://) 
color tables 
colorAllocate method 
command-line 
input files, naming on 
interface to SQL 
commands 
interpretation line 
comments 
Common Gateway Interface [See CGI] 
complex (or imaginary) numbers 
complex data structures 2nd 
dereferencing 
hashes with array values 
printing 
Comprehensive Perl Archive Network [See CPAN] 
computer graphics [See graphics] 
concatenating strings 
conditional statements 
expressions in loops 
connect method 
constructor methods 
FilelO.pm 
constructors 
clone 
Gene.pm 
context 
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control flow 
logical operators, using for 
copying 
arrays 
CPAN (Comprehensive Perl Archive Network) 
create database command 
create table command 
croak function 
croak subroutine 
curly braces ( {}_) 2nd 
dereferencing and 


current directory 
cut sites 
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data compression and graphics file formats 
data redundancy 
data structures 2nd 3rd [See also complex data structures] 
data types 2nd 
Data::Dumper module 
databases 2nd 3rd [See also Perl DBD; Perl DBI; SQL]4th 
administration 
adding users 
backups 
reloading 
create database command 
create table command 
design 
drop command 
insert command 
installation 
popular versions 
relational databases 
stored procedures 
tab-delimited input files 
transactions 
updates 
DBD: :MySQL 
dbh 
DBM files and hashes 
DBMSs (database management systems) 
SQL and 
decimal numbers 
declarative programming 
decrementing variables 
defined and undefined values 
defined function 
dereferencing 
complex data structures 
derived class 
DESTROY subroutine 
die function 
disconnect call 
do-until loops 
do-while loops 
documentation, Perl operators (perlop) 
dollar sign ($) 
dot (.) string operator 
double colons (::) in module names 
downloading Perl 
DPI 
drawmap_ jpg method 
drawmap_ png method 
drawmap_ text method 
drop command 
dump command (databases) 
dynamic programming 
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edit distance 
else statements 
elsif statements 
em(_) function 
email 
encapsulation 2nd 
end_form function 
end_html function 
entity integrity 
entity-relationship modeling 
error messages 

directing to STDERR 
"exclusive-OR" operator (xor) 2nd 
execute method 
exponential notation 























[ Team LiB ] 


Page 383 


[ Team LiB ] 


[SYMBOL] [A] [B] [C] [DB] [E] (F] (G] (H] (1) (3) [kK] [L] (M] [N] [0] [2] [Q] [R] [S) [T] (UY) (Vv) (W] [x] 





false or true value, evaluating with conditionals 








FASTA header 
file test operators 
filehandles 
output 
FileIO.pm 
AUTOLOAD method 
constructor method 
read method 
stat_and localtime functions 














test program 
write method 
files 
directing output to 
graphics formats 2nd 
input from 
named on command line 
opening 
first normal form 
$flag variable 
floating-point numbers 
for loops 
foreach loops 
foreign keys 
format function 
formatting output using printf 
forward slash (/) 
fractions 
FTP. 
functional dependencies 
functions, built-in 
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garbage collection 
gd graphics library 2nd 
GD.pm 2nd 
color table manipulation 
compatible graphics file formats 
installing 
Restrictionmap.pm, adding graphics to 
using 
GD::Graph module 2nd 
gdi.pl program 
gd2.pl program 
Gene.pm 
constructor method 
test program 
Genel.pm 
Gene2.pm 
accessor and mutator methods 
new method 
test program 
Gene3.pm 
AUTOLOAD [See AUTOLOAD subroutine] 
test program 
genetic variability and string matching 
Geneticcode.pm 
get_ 
get_bionetfile method 
get_dbmfile method 
get_graphic method 
get_mode method 
get_recognition_sites method 2nd 
get_regular_expressions method 2nd 
GIF (Graphic Interchange Format) 
Gimp (GNU Image Manipulation Program) 
global variables 
graphics 
applying color 
file formats 2nd 
data compression 
GD compatible 
graphics primitives 
graphs, creating 
methods 
requirements for 
vector graphics 
graphics data, storage in scalar 
graphics output, storing in object attributes 




































































greedy matching 
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hard references 
hashes 2nd 3rd 
%Ygenetic code 
anonymous hashes 
hash keys 
references to 
use in Perl as objects 
with array values 
head tags 
header fields 
hexadecimal (base 16) numbers 
higher dimensional matrices 
home relations 
homologs database 
homologs.getdata program 
homologs.load program 
homologs.tabs program 
hostname 
HTML (Hypertext Markup Language) 
directives 
tags 
web page example 
HTTP (Hypertext Transport Protocol) 
http:// 
hypertext links 
Hypertext Markup Language [See HTML] 
Hypertext Transport Protocol [See HTTP] 
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if statements 


ImageMagick and Image::Magick module 





imaginary numbers 
incrementing variables 
indexing 

scalar values in arrays 
inheritance 2nd 
input 

from files 

named on command line 














STDIN (standard input) 
insert command 
alternatives to 
instance of a class 
integers 2nd 
Internet 
Internet addresses 
IP addresses 
is_ methods 
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JPEG (Joint Photographic Experts Group) 2nd 
outputting data as 
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key/value pairs 2nd 3rd 
keys 

databases 

primary keys 2nd 
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left_shift operator (<<) 

libraries 

line input operator (<>) 

Linux 
compiling Perl from source 
installing Perl binaries on 
Perl programs, running on 

list context 

load utility (SQL) 

local variables 

localtime function 

logical operators, using for control flow 


























loops 
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Macintosh, running Perl programs on 
MacOS X, specifying input files on command line 
mapgraphics attribute 

matrices 

dynamic programming 

higher dimensional matrices 
matrix 
maximal (greedy) matching 
memory management, cleaning up unused objects 
metacharacters 
metasymbols 
methods 2nd 

accessor methods 

arrow notation and 

AUTOLOAD and 

constructor methods 

mutator methods 

Rebase class 

parse _rebase 

minimal matching 
modules 2nd 3rd 

advantages 


Carp module 
colons in module names 


CPAN modules 
defining 
exporting names 
Geneticcode.pm 
storing 
mutators 2nd 
Gene2.pm 
my 
my variables 
MySQL 2nd 
mysql 
MySQL 
multithreading 
Perl DBD driver for 
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named fields 
names, scalar variables 
namespace collisions 
namespaces 
capitalization 
new method 
Genel class 
Gene2.pm 
normal forms 
normalization 
not operator 








numbers 
floating-point 


as scalar values 
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object-oriented (OO) programming 
object-oriented programming 2nd 
objects 2nd 

clearing from memory 

hash data structures 

instance of a class 

representation as hashes 
octal (base 8) numbers 
open system call 
opening files 
Operators 

arithmetic 

assignment 

binding 

bitwise 

context and 

file test 

logical 

conditionals and 

precedence of 

string 
or operator 

| (bitwise OR) 

logical or 

control flow, using for 

output 
































directing to STDOUT, STDERR and files 
functions for 
output, formatting with printf 
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package declaration 
packages 
palettes 
paragraph tags 
param(_) function 
parent class 
parse methods 
parse_rebase method 
parse rebase program 
passing by reference 
passing references to subroutines 
pathnames on the Web 
patterns (and regular expressions) 
binding operators 
metacharacters 
metasymbols 
modifiers 
ercent sign (% 
Perl 
arrays [See arrays] 
assignment 
built-in functions 
command interpretation 
comments 
compiling from source 
conditional statements 
logical operations and 
documentation 
downloading 
finding help 
hashes [See hashes] 
input/output 
installing 
binary vs. source code 
loops 
object-oriented programming 
operators [See operators] 
regular expressions 
running programs 
scalar and list context 
scalar values 
statements 
blocks and 
subroutines 
variables 
scalar 
versions 
Perl DBD (DataBase Dependent) modules 
installing and configuring 
Perl DBI (DataBase Independent) module 2nd 
connect method 
disconnect method 
examples 
execute method 
installing and configuring 
loading the module 
prepare method 
SOL queries 
tab-delimited files, reading in to a database 
PERL5LIB environmental variable 2nd 
perldoc command 
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perlmod 

perlmodlib 

perlop documentation 

picture element 

pixels 

plain old documentation [See POD] 
PNG (Portable Network Graphics) 2nd 











outputting data in 
POD (plain old documentation) 
positions in arrays 
precedence 
logical operators 
Operator 
primary keys 2nd 3rd 
primitives 
print function 
printf function 2nd 
printing complex data structures 
programming 
declarative programming 
design phase 
documentation 
object-oriented (OO) programming 
Perl language, summary 
web programming 
programs, running 
put_ methods 
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uantifiers 
maximal and minimal 


quotes 
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raster images 
rational numbers 


ratios 
read method 
Rebase (Restriction Enzyme Database) 
Rebase dynamic web pages 
Rebase.pm 
attributes 
methods 
parse rebase 
Rebase object, creating 
testing 
RebaseDB.pm 
analysis 
testing program 
recognition sites, mapping 2nd 
ref function 
references 2nd 
to arrays 
to hashes 
passing to subroutines 
to references 
returning from subroutines 
to subroutines 
symbolic vs. hard 
within blocks 
referential integrity 
referents 
anonymous referents 
regular expressions 
relational databases [See databases] 
relational model 
relations 
repeating strings (x operator) 
request 
request method 
response 
Restriction class 
creating 
planning 
restriction enzymes 
restriction maps 
creating 
Restriction.pm 
documentation 


initializing objects 
Restrictionmap.pm 
graphics enhancements 
JPEG output, adding 
Restrictionmap class 
testing 
returning references from subroutines 
reverse complements, changing bases into 
RGB 
right shift operator (>>) 
rows 
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s/// (substitution) operator 
scalar context 
arrays in 
scalar values 
assigning to arrays 
assigning to scalar variables 
numbers 
strings 2nd [See also strings] 
scalar variables 
assigning scalar values to 
scalars 
storing graphics data in 
scheme 
scientific (exponential) notation 
scripts (CGI) 
second normal form 
select command 
SeqgFilelIO.pm 


test program 
SequencelO module 


SequencelO.pm 2nd 
set. 
software reuse 
sparse arrays 
sprintf function 
SQL (Structured Query Language) 2nd 3rd [See also databases] 
commands 
create database 
create table 
drop 
insert 
load utility 
gueries 
bind variables 
select command 
SQL2 
SQL3 
square brackets ([ ]) 
start_multipart_form function 
stat function 
statements 
status line 
STDERR filehandle 
STDIN filehandle 
STDOUT filehandle 
stored procedures 
string algorithms 
strings 
binding operators 
capturing matched patterns in 
formatting (sprintf function) 
matching 
genetic variability and 
Operators 
substituting characters in (tr/// operator) 
subroutines 2nd 
passing data to 
by reference 
passing references to 
references to 
returning references from 
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substitution (s///) operator 
SUPER class 

superclass 

symbol tables 

symbolic references 
syntactic sugar symbol (=>) 
system calls, open and close 
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tab-delimited input files 

SQL load utility and 
tables 

creating 

populating 
tags 
testRebaseDB program 
title tags 
tr/// (transliteration) operator 
transactions 
transliteration (tr///) operator 
true or false value, evaluating with conditionals 
truecolor 
tuples 
two-dimensional arrays 
two-dimensional matrices 
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undefined values 2nd 
unique identifiers 
Unix 
compiling Perl from source 
installing Perl binaries on 
Perl programs, running on 
specifying input files on command line 
unless statements 
update anomalies 
URI::URL modules 
URLs (Uniform Resource Locators) 
use lib directive 2nd 
use strict 
AUTOLOAD, bypassing with 
use strict directive 
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[SYMBOL] [A] [B] [C] [DB] [E] [E) [G) (4) [1] (J) [K] [EL] (M] [NJ LO] [P] [Q] [R] [S} [1] (U] [Mw] [W] [x] 








variables 
scalar 
assigning scalar values to 
testing and changing value in loops 
vector graphics 
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[SYMBOL] [A] [B] [C] [DB] [E] [E] [G) (4) [1] (J) [K] [L) (M] [N] LO] [P] [Q] [R] [S} [1] (U] [Vv] [W] [x] 





warn function 
warnings flag 
Web 
web browsers 
web pages 
building 
directory locations 
example 
web programming 
web servers 
webrebasel 
analysis 
installing 
while loops 
Windows 
Perl programs, running on 
Windows systems 
specifying input files on command line 

















World Wide Web [See Web] 
write function 
write method 

FilelO.pm 
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[SYMBOL] [A] [B] [C] [DB] [E] [E] [G) (4) [2] (J) [K] [L] [(M] [N] LO] [P] [Q] [R] [S} (1) (U) (Vv) (W) [x] 








X operator 
x string operator 


xor operator 
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